1. 09 1月, 2009 10 次提交
    • K
      memcg: swap cgroup for remembering usage · 27a7faa0
      KAMEZAWA Hiroyuki 提交于
      For accounting swap, we need a record per swap entry, at least.
      
      This patch adds following function.
        - swap_cgroup_swapon() .... called from swapon
        - swap_cgroup_swapoff() ... called at the end of swapoff.
      
        - swap_cgroup_record() .... record information of swap entry.
        - swap_cgroup_lookup() .... lookup information of swap entry.
      
      This patch just implements "how to record information".  No actual method
      for limit the usage of swap.  These routine uses flat table to record and
      lookup.  "wise" lookup system like radix-tree requires requires memory
      allocation at new records but swap-out is usually called under memory
      shortage (or memcg hits limit.) So, I used static allocation.  (maybe
      dynamic allocation is not very hard but it adds additional memory
      allocation in memory shortage path.)
      
      Note1: In this, we use pointer to record information and this means
            8bytes per swap entry. I think we can reduce this when we
            create "id of cgroup" in the range of 0-65535 or 0-255.
      Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Reported-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27a7faa0
    • K
      memcg: mem+swap controller Kconfig · c077719b
      KAMEZAWA Hiroyuki 提交于
      Config and control variable for mem+swap controller.
      
      This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      (memory resource controller swap extension.)
      
      For accounting swap, it's obvious that we have to use additional memory to
      remember "who uses swap".  This adds more overhead.  So, it's better to
      offer "choice" to users.  This patch adds 2 choices.
      
      This patch adds 2 parameters to enable swap extension or not.
        - CONFIG
        - boot option
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c077719b
    • K
      memcg: handle swap caches · d13d1443
      KAMEZAWA Hiroyuki 提交于
      SwapCache support for memory resource controller (memcg)
      
      Before mem+swap controller, memcg itself should handle SwapCache in proper
      way.  This is cut-out from it.
      
      In current memcg, SwapCache is just leaked and the user can create tons of
      SwapCache.  This is a leak of account and should be handled.
      
      SwapCache accounting is done as following.
      
        charge (anon)
      	- charged when it's mapped.
      	  (because of readahead, charge at add_to_swap_cache() is not sane)
        uncharge (anon)
      	- uncharged when it's dropped from swapcache and fully unmapped.
      	  means it's not uncharged at unmap.
      	  Note: delete from swap cache at swap-in is done after rmap information
      	        is established.
        charge (shmem)
      	- charged at swap-in. this prevents charge at add_to_page_cache().
      
        uncharge (shmem)
      	- uncharged when it's dropped from swapcache and not on shmem's
      	  radix-tree.
      
        at migration, check against 'old page' is modified to handle shmem.
      
      Comparing to the old version discussed (and caused troubles), we have
      advantages of
        - PCG_USED bit.
        - simple migrating handling.
      
      So, situation is much easier than several months ago, maybe.
      
      [hugh@veritas.com: memcg: handle swap caches build fix]
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d13d1443
    • K
      memcg: new force_empty to free pages under group · c1e862c1
      KAMEZAWA Hiroyuki 提交于
      By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
      memory usage and force_empty is removed.
      
      This patch adds "force_empty" again, in reasonable manner.
      
      memory.force_empty file works when
      
        #echo 0 (or some) > memory.force_empty
        and have following function.
      
        1. only works when there are no task in this cgroup.
        2. free all page under this cgroup as much as possible.
        3. page which cannot be freed will be moved up to parent.
        4. Then, memcg will be empty after above echo returns.
      
      This is much better behavior than old "force_empty" which just forget
      all accounts. This patch also check signal_pending() and above "echo"
      can be stopped by "Ctrl-C".
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e862c1
    • J
      memcg: reduce size of mem_cgroup by using nr_cpu_ids · c8dad2bb
      Jan Blunck 提交于
      As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for
      memcg to the size of NR_CPUS is not good.
      
      This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS
      but based on nr_cpu_ids.
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8dad2bb
    • K
      memcg: move all acccounting to parent at rmdir() · f817ed48
      KAMEZAWA Hiroyuki 提交于
      This patch provides a function to move account information of a page
      between mem_cgroups and rewrite force_empty to make use of this.
      
      This moving of page_cgroup is done under
       - lru_lock of source/destination mem_cgroup is held.
       - lock_page_cgroup() is held.
      
      Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
      should confirm pc->mem_cgroup is still valid or not.  Typical code can be
      following.
      
      (while page is not under lock_page())
      	mem = pc->mem_cgroup;
      	mz = page_cgroup_zoneinfo(pc)
      	spin_lock_irqsave(&mz->lru_lock);
      	if (pc->mem_cgroup == mem)
      		...../* some list handling */
      	spin_unlock_irqrestore(&mz->lru_lock);
      
      Of course, better way is
      	lock_page_cgroup(pc);
      	....
      	unlock_page_cgroup(pc);
      
      But you should confirm the nest of lock and avoid deadlock.
      
      If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
      you don't have to worry about what pc->mem_cgroup points to.
      moved pages are added to head of lru, not to tail.
      
      Expected users of this routine is:
        - force_empty (rmdir)
        - moving tasks between cgroup (for moving account information.)
        - hierarchy (maybe useful.)
      
      force_empty(rmdir) uses this move_account and move pages to its parent.
      This "move" will not cause OOM (I added "oom" parameter to try_charge().)
      
      If the parent is busy (not enough memory), force_empty calls try_to_free_page()
      and reduce usage.
      
      Purpose of this behavior is
        - Fix "forget all" behavior of force_empty and avoid leak of accounting.
        - By "moving first, free if necessary", keep pages on memory as much as
          possible.
      
      Adding a switch to change behavior of force_empty to
        - free first, move if necessary
        - free all, if there is mlocked/busy pages, return -EBUSY.
      is under consideration. (I'll add if someone requtests.)
      
      This patch also removes memory.force_empty file, a brutal debug-only interface.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f817ed48
    • F
      memcg: do not recalculate section unnecessarily in init_section_page_cgroup · 0753b0ef
      Fernando Luis Vazquez Cao 提交于
      In init_section_page_cgroup() the section a given pfn belongs to is
      calculated at the top of the function and, despite the fact that the
      pfn/section correspondence does not change, it is recalculated further
      down the same function.  By computing this just once and reusing that
      value we save some bytes in the object file and do not waste CPU cycles.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753b0ef
    • K
      memcg: simple migration handling · 01b1ae63
      KAMEZAWA Hiroyuki 提交于
      Now, management of "charge" under page migration is done under following
      manner. (Assume migrate page contents from oldpage to newpage)
      
       before
        - "newpage" is charged before migration.
       at success.
        - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
       at failure
        - "newpage" is uncharged.
        - "oldpage" is charged if necessary (*1)
      
      But (*1) is not reliable....because of GFP_ATOMIC.
      
      This patch tries to change behavior as following by charge/commit/cancel ops.
      
       before
        - charge PAGE_SIZE (no target page)
       success
        - commit charge against "newpage".
       failure
        - commit charge against "oldpage".
          (PCG_USED bit works effectively to avoid double-counting)
        - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01b1ae63
    • K
      memcg: fix gfp_mask of callers of charge · bced0520
      KAMEZAWA Hiroyuki 提交于
      Fix misuse of gfp_kernel.
      
      Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
      
      I think that this is from the fact that page_cgroup *was* dynamically
      allocated.
      
      But now, we allocate all page_cgroup at boot.  And
      mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
      specified GFP_RECLAIM_MASK.
      
        * This is because we just want to reduce memory usage.
          "Where we should reclaim from ?" is not a problem in memcg.
      
      This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
      
      Note: This patch is not for fixing behavior but for showing sane information
            in source code.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bced0520
    • K
      memcg: introduce charge-commit-cancel style of functions · 7a81b88c
      KAMEZAWA Hiroyuki 提交于
      There is a small race in do_swap_page().  When the page swapped-in is
      charged, the mapcount can be greater than 0.  But, at the same time some
      process (shares it ) call unmap and make mapcount 1->0 and the page is
      uncharged.
      
            CPUA 			CPUB
             mapcount == 1.
         (1) charge if mapcount==0     zap_pte_range()
                                      (2) mapcount 1 => 0.
      			        (3) uncharge(). (success)
         (4) set page's rmap()
             mapcount 0=>1
      
      Then, this swap page's account is leaked.
      
      For fixing this, I added a new interface.
        - charge
         account to res_counter by PAGE_SIZE and try to free pages if necessary.
        - commit
         register page_cgroup and add to LRU if necessary.
        - cancel
         uncharge PAGE_SIZE because of do_swap_page failure.
      
           CPUA
        (1) charge (always)
        (2) set page's rmap (mapcount > 0)
        (3) commit charge was necessary or not after set_pte().
      
      This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
      Usual mem_cgroup_charge_common() does charge -> commit at a time.
      
      And this patch also adds following function to clarify all charges.
      
        - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
      	called against newly allocated anon pages.
      
        - mem_cgroup_charge_migrate_fixup()
              called only from remove_migration_ptes().
      	we'll have to rewrite this later.(this patch just keeps old behavior)
      	This function will be removed by additional patch to make migration
      	clearer.
      
      Good for clarifying "what we do"
      
      Then, we have 4 following charge points.
        - newpage
        - swap-in
        - add-to-cache.
        - migration.
      
      [akpm@linux-foundation.org: add missing inline directives to stubs]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a81b88c
  2. 07 1月, 2009 30 次提交
    • G
      Remove obsolete CONFIG_RESOURCES_64BIT · 67faaada
      Geert Uytterhoeven 提交于
      commit 8308c54d ("generic: redefine
      resource_size_t as phys_addr_t") made CONFIG_RESOURCES_64BIT obsolete, but
      didn't remove it. Remove it.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67faaada
    • C
      mm: hugetlb: remove redundant `if' operation · 91f47662
      Cyrill Gorcunov 提交于
      At this point we already know that 'addr' is not NULL so get rid of
      redundant 'if'.  Probably gcc eliminate it by optimization pass.
      
      [akpm@linux-foundation.org: use __weak, too]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91f47662
    • K
      mm: stop kswapd's infinite loop at high order allocation · 73ce02e9
      KOSAKI Motohiro 提交于
      Wassim Dagash reported following kswapd infinite loop problem.
      
        kswapd runs in some infinite loop trying to swap until order 10 of zone
        highmem is OK.... kswapd will continue to try to balance order 10 of zone
        highmem forever (or until someone release a very large chunk of highmem).
      
      For non order-0 allocations, the system may never be balanced due to
      fragmentation but kswapd should not infinitely loop as a result.
      
      Instead, recheck all watermarks at order-0 as they are the most important.
      If watermarks are ok, kswapd will go back to sleep.
      
      [akpm@linux-foundation.org: fix comment]
      Reported-by: Nwassim dagash <wassim.dagash@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ce02e9
    • J
      bootmem: print request details before BUG_ON(them) · 594fe1a0
      Johannes Weiner 提交于
      Moving the request details print-out before the sanity checks that
      might panic() enables us to analyse invalid requests without having
      access to the line information of the stack dump.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      594fe1a0
    • J
      mm: check for no mmaps in exit_mmap() · dcd4a049
      Johannes Weiner 提交于
      When dup_mmap() ooms we can end up with mm->mmap == NULL.  The error
      path does mmput() and unmap_vmas() gets a NULL vma which it
      dereferences.
      
      In exit_mmap() there is nothing to do at all for this case, we can
      cancel the callpath right there.
      
      [akpm@linux-foundation.org: add sorely-needed comment]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcd4a049
    • K
      mm: kill page_queue_congested() · 084f71ae
      KOSAKI Motohiro 提交于
      page_queue_congested() was introduced in 2002, but it was never used
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      084f71ae
    • K
      mm: remove CONFIG_OUT_OF_LINE_PFN_TO_PAGE · 9f572e3f
      KOSAKI Motohiro 提交于
      No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f572e3f
    • O
      mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting · 901608d9
      Oleg Nesterov 提交于
      xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
      mm->hiwater_xxx directly, this leads to 2 problems:
      
      - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
        moment before the task exits, so we should check the current values of
        rss/vm anyway.
      
      - do_exit()->update_hiwater_xxx() calls are racy.  An exiting thread can
        be preempted right before mm->hiwater_xxx = new_val, and another thread
        can use A_LOT of memory and exit in between.  When the first thread
        resumes it can be the last thread in the thread group, in that case we
        report the wrong hiwater_xxx values which do not take A_LOT into
        account.
      
      Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
      xacct_add_tsk() to use them.  The first helper will also be used by
      rusage->ru_maxrss accounting.
      
      Kill do_exit()->update_hiwater_xxx() calls.  Unless we are going to
      decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
      can look at this mm_struct when exit_mmap() actually unmaps the memory.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      901608d9
    • N
      mm: pagecache gfp flags fix · 67d58ac4
      Nick Piggin 提交于
      Frustratingly, gfp_t is really divided into two classes of flags.  One are
      the context dependent ones (can we sleep?  can we enter filesystem?  block
      subsystem?  should we use some extra reserves, etc.).  The other ones are
      the type of memory required and depend on how the algorithm is implemented
      rather than the point at which the memory is allocated (highmem?  dma
      memory?  etc).
      
      Some of the functions which allocate a page and add it to page cache take
      a gfp_t, but sometimes those functions or their callers aren't really
      doing the right thing: when allocating pagecache page, the memory type
      should be mapping_gfp_mask(mapping).  When allocating radix tree nodes,
      the memory type should be kernel mapped (not highmem) memory.  The gfp_t
      argument should only really be needed for context dependent options.
      
      This patch doesn't really solve that tangle in a nice way, but it does
      attempt to fix a couple of bugs.
      
      - find_or_create_page changes its radix-tree allocation to only include
        the main context dependent flags in order so the pagecache page may be
        allocated from arbitrary types of memory without affecting the
        radix-tree.  In practice, slab allocations don't come from highmem
        anyway, and radix-tree only uses slab allocations.  So there isn't a
        practical change (unless some fs uses GFP_DMA for pages).
      
      - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
        GFP_NOFS, because it is not supposed to reenter the filesystem.  This
        bug could cause lock recursion if a filesystem is not expecting the
        function to reenter the fs (as-per documentation).
      
      Filesystems should be careful about exactly what semantics they want and
      what they get when fiddling with gfp_t masks to allocate pagecache.  One
      should be as liberal as possible with the type of memory that can be used,
      and same for the the context specific flags.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d58ac4
    • N
      mm: direct IO starvation improvement · 48b47c56
      Nick Piggin 提交于
      Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
       A 4K direct IO will actually try to sync and/or invalidate the pagecache
      of the entire file, for example (which might be many GB or TB large).
      
      Improve this by doing range syncs.  Also, memory no longer has to be
      unmapped to catch the dirty bits for syncing, as dirty bits would remain
      coherent due to dirty mmap accounting.
      
      This fixes the immediate DM deadlocks when doing direct IO reads to block
      device with a mounted filesystem, if only by papering over the problem
      somewhat rather than addressing the fsync starvation cases.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48b47c56
    • Z
      mm/mmap.c: fix coding style · 48aae425
      ZhenwenXu 提交于
      Fix a little of the coding style in mm/mmap.c
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NZhenwenXu <helight.xu@gmail.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48aae425
    • M
      shmem: unify regular and tiny shmem · 853ac43a
      Matt Mackall 提交于
      tiny-shmem shares most of its 130 lines of code with shmem and tends to
      break when particular bits of shmem get modified.  Unifying saves code and
      makes keeping these two in sync much easier.
      
      before:
        14367	    392	     24	  14783	   39bf	mm/shmem.o
          396      72       8     476	    1dc	mm/tiny-shmem.o
      
      after:
        14367	    392	     24	  14783	   39bf	mm/shmem.o
          412	     72       8     492	    1ec	mm/shmem.o tiny
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      853ac43a
    • Y
      mm: make get_user_pages() interruptible · 4779280d
      Ying Han 提交于
      The initial implementation of checking TIF_MEMDIE covers the cases of OOM
      killing.  If the process has been OOM killed, the TIF_MEMDIE is set and it
      return immediately.  This patch includes:
      
      1.  add the case that the SIGKILL is sent by user processes.  The
         process can try to get_user_pages() unlimited memory even if a user
         process has sent a SIGKILL to it(maybe a monitor find the process
         exceed its memory limit and try to kill it).  In the old
         implementation, the SIGKILL won't be handled until the get_user_pages()
         returns.
      
      2.  change the return value to be ERESTARTSYS.  It makes no sense to
         return ENOMEM if the get_user_pages returned by getting a SIGKILL
         signal.  Considering the general convention for a system call
         interrupted by a signal is ERESTARTNOSYS, so the current return value
         is consistant to that.
      
      Lee:
      
      An unfortunate side effect of "make-get_user_pages-interruptible" is that
      it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
      resulting in freeing of mlocked pages.  Freeing of mlocked pages, in
      itself, is not so bad.  We just count them now--altho' I had hoped to
      remove this stat and add PG_MLOCKED to the free pages flags check.
      
      However, consider pages in shared libraries mapped by more than one task
      that a task mlocked--e.g., via mlockall().  If the task that mlocked the
      pages exits via SIGKILL, these pages would be left mlocked and
      unevictable.
      
      Proposed fix:
      
      Add another GUP flag to ignore sigkill when calling get_user_pages from
      munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
      the same purpose.  We are not actually allocating memory in this case,
      which "make-get_user_pages-interruptible" intends to avoid.  We're just
      munlocking pages that are already resident and mapped, and we're reusing
      get_user_pages() to access those pages.
      
      ??  Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
      into a single flag: GUP_FLAGS_MUNLOCK ???
      
      [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
      Signed-off-by: NPaul Menage <menage@google.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4779280d
    • A
      vmscan: shrink_active_list(): reduce lru_lock hold time · b555749a
      Andrew Morton 提交于
      These three statements manipulate local variables and do not need the lock
      coverage.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b555749a
    • H
      badpage: KERN_ALERT BUG instead of KERN_EMERG · 1e9e6365
      Hugh Dickins 提交于
      bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
      which I've followed in print_bad_pte().  These are serious system errors,
      on a par with BUGs, but they're not quite emergencies, and we do our best
      to carry on: say KERN_ALERT "BUG: " like the x86 oops does.
      
      And remove the "Trying to fix it up, but a reboot is needed" line: it's
      not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e9e6365
    • H
      badpage: ratelimit print_bad_pte and bad_page · d936cf9b
      Hugh Dickins 提交于
      print_bad_pte() and bad_page() might each need ratelimiting - especially
      for their dump_stacks, almost never of interest, yet not quite
      dispensible.  Correlating corruption across neighbouring entries can be
      very helpful, so allow a burst of 60 reports before keeping quiet for the
      remainder of that minute (or allow a steady drip of one report per
      second).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d936cf9b
    • H
      badpage: remove vma from page_remove_rmap · edc315fd
      Hugh Dickins 提交于
      Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
      And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
      page_dup_rmap(): we're trying to be more resilient about that than BUGs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc315fd
    • H
      badpage: zap print_bad_pte on swap and file · 2509ef26
      Hugh Dickins 提交于
      Complete zap_pte_range()'s coverage of bad pagetable entries by calling
      print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
      That needs free_swap_and_cache() to tell it, which will also have shown
      one of those "swap_free" errors (but with much less information).
      
      Similar checks in fork's copy_one_pte()?  No, that would be more noisy
      than helpful: we'll see them when parent and child exec or exit.
      
      Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
      case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
      VM_FAULT_OOM rather than VM_FAULT_SIGBUS?  Well, okay, that is consistent
      with what happens if do_swap_page() operates a bad swap entry; but don't
      we have patches to be more careful about killing when VM_FAULT_OOM?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2509ef26
    • H
      badpage: vm_normal_page use print_bad_pte · 22b31eec
      Hugh Dickins 提交于
      print_bad_pte() is so far being called only when zap_pte_range() finds
      negative page_mapcount, or there's a fault on a pte_file where it does not
      belong.  That's weak coverage when we suspect pagetable corruption.
      
      Originally, it was called when vm_normal_page() found an invalid pfn: but
      pfn_valid is expensive on some architectures and configurations, so 2.6.24
      put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
      2.6.26 replaced it by a VM_BUG_ON (likewise).
      
      Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
      than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
      __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
      pfn against that.  We could call this pfn_plausible() or pfn_sane(), but I
      doubt we'll need it elsewhere: of course it's not reliable, but gives much
      stronger pagetable validation on many boxes.
      
      Also use print_bad_pte() when the pte_special bit is found outside a
      VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22b31eec
    • H
      badpage: replace page_remove_rmap Eeek and BUG · 3dc14741
      Hugh Dickins 提交于
      Now that bad pages are kept out of circulation, there is no need for the
      infamous page_remove_rmap() BUG() - once that page is freed, its negative
      mapcount will issue a "Bad page state" message and the page won't be
      freed.  Removing the BUG() allows more info, on subsequent pages, to be
      gathered.
      
      We do have more info about the page at this point than bad_page() can know
      - notably, what the pmd is, which might pinpoint something like low 64kB
      corruption - but page_remove_rmap() isn't given the address to find that.
      
      In practice, there is only one call to page_remove_rmap() which has ever
      reported anything, that from zap_pte_range() (usually on exit, sometimes
      on munmap).  It has all the info, so remove page_remove_rmap()'s "Eeek"
      message and leave it all to zap_pte_range().
      
      mm/memory.c already has a hardly used print_bad_pte() function, showing
      some of the appropriate info: extend it to show what we want for the rmap
      case: pte info, page info (when there is a page) and vma info to compare.
      zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
      use if it works that out for itself.
      
      Some of this info is also shown in bad_page()'s "Bad page state" message.
      Keep them separate, but adjust them to match each other as far as
      possible.  Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
      there too.
      
      print_bad_pte() show current->comm unconditionally (though it should get
      repeated in the usually irrelevant stack trace): sorry, I misled Nick
      Piggin to make it conditional on vm_mm == current->mm, but current->mm is
      already NULL in the exit case.  Usually current->comm is good, though
      exceptionally it may not be that of the mm (when "swapoff" for example).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dc14741
    • H
      badpage: keep any bad page out of circulation · 8cc3b392
      Hugh Dickins 提交于
      Until now the bad_page() checkers have special-cased PageReserved, keeping
      those pages out of circulation thereafter.  Now extend the special case to
      all: we want to keep ANY page with bad state out of circulation - the
      "free" page may well be in use by something.
      
      Leave the bad state of those pages untouched, for examination by
      debuggers; except for PageBuddy - leaving that set would risk bringing the
      page back.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc3b392
    • H
      badpage: simplify page_alloc flag check+clear · 79f4b7bf
      Hugh Dickins 提交于
      Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
      a page: check the same flags as before when freeing, clear ALL the flags
      (unless PageReserved) when freeing, check ALL flags off when allocating.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f4b7bf
    • K
      mm: kill zone_is_near_oom() · 09f445e7
      KOSAKI Motohiro 提交于
      zone_is_near_oom() is unused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f445e7
    • K
      vmscan: improve reclaim throughput to bail out patch · 01dbe5c9
      KOSAKI Motohiro 提交于
      The vmscan bail out patch move nr_reclaimed variable to struct
      scan_control.  Unfortunately, indirect access can easily happen cache
      miss.
      
      if heavy memory pressure happend, that's ok.
      cache miss already plenty. it is not observable.
      
      but, if memory pressure is lite, performance degression is obserbable.
      
      I compared following three pattern (it was mesured 10 times each)
      
      hackbench 125 process 3000
      hackbench 130 process 3000
      hackbench 135 process 3000
      
                  2.6.28-rc6                       bail-out
      
      	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		93.414	73.254	193.382
      	74.145	78.295	77.27		74.897	75.021	80.17
      	70.305	77.643	75.855		70.134	77.571	79.896
      	74.288	73.986	75.955		77.222	78.48	80.619
      	72.029	79.947	78.312		75.128	82.172	79.708
      	71.499	77.615	77.042		74.177	76.532	77.306
      	76.188	74.471	83.562		73.839	72.43	79.833
      	73.236	75.606	78.743		76.001	76.557	82.726
      	69.427	77.271	76.691		76.236	79.371	103.189
      	72.473	76.978	80.643		69.128	78.932	75.736
      
      avg	72.545	76.767	78.534		76.017	77.03	93.256
      std	1.89	1.71	2.41		6.29	2.79	34.16
      min	69.427	73.986	75.855		69.128	72.43	75.736
      max	76.188	79.947	83.562		93.414	82.172	193.382
      
      about 4-5% degression.
      
      Then, this patch introduces a temporary local variable.
      
      result:
      
                  2.6.28-rc6                       this patch
      
      num	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		67.302	68.269	77.161
      	74.145	78.295	77.27   	72.616	72.712	79.06
      	70.305	77.643	75.855  	72.475	75.712	77.735
      	74.288	73.986	75.955  	69.229	73.062	78.814
      	72.029	79.947	78.312  	71.551	74.392	78.564
      	71.499	77.615	77.042  	69.227	74.31	78.837
      	76.188	74.471	83.562  	70.759	75.256	76.6
      	73.236	75.606	78.743  	69.966	76.001	78.464
      	69.427	77.271	76.691  	69.068	75.218	80.321
      	72.473	76.978	80.643  	72.057	77.151	79.068
      
      avg	72.545	76.767	78.534 		70.425	74.2083	78.462
      std 	1.89	1.71	2.41    	1.66	2.34	1.00
      min 	69.427	73.986	75.855  	67.302	68.269	76.6
      max 	76.188	79.947	83.562  	72.616	77.151	80.321
      
      OK. the degression is disappeared.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dbe5c9
    • R
      vmscan: bail out of direct reclaim after swap_cluster_max pages · a79311c1
      Rik van Riel 提交于
      When the VM is under pressure, it can happen that several direct reclaim
      processes are in the pageout code simultaneously.  It also happens that
      the reclaiming processes run into mostly referenced, mapped and dirty
      pages in the first round.
      
      This results in multiple direct reclaim processes having a lower
      pageout priority, which corresponds to a higher target of pages to
      scan.
      
      This in turn can result in each direct reclaim process freeing
      many pages.  Together, they can end up freeing way too many pages.
      
      This kicks useful data out of memory (in some cases more than half
      of all memory is swapped out).  It also impacts performance by
      keeping tasks stuck in the pageout code for too long.
      
      A 30% improvement in hackbench has been observed with this patch.
      
      The fix is relatively simple: in shrink_zone() we can check how many
      pages we have already freed, direct reclaim tasks break out of the
      scanning loop if they have already freed enough pages and have reached
      a lower priority level.
      
      We do not break out of shrink_zone() when priority == DEF_PRIORITY,
      to ensure that equal pressure is applied to every zone in the common
      case.
      
      However, in order to do this we do need to know how many pages we already
      freed, so move nr_reclaimed into scan_control.
      
      akpm: a historical interlude...
      
      We tried this in 2004:
      
      :commit e468e46a9bea3297011d5918663ce6d19094cf87
      :Author: akpm <akpm>
      :Date:   Thu Jun 24 15:53:52 2004 +0000
      :
      :[PATCH] vmscan.c: dont reclaim too many pages
      :
      :    The shrink_zone() logic can, under some circumstances, cause far too many
      :    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
      :    a large number of reclaimable pages on the LRU.
      :    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
      
      And we reverted it in 2006:
      
      :commit 210fe530
      :Author: Andrew Morton <akpm@osdl.org>
      :Date:   Fri Jan 6 00:11:14 2006 -0800
      :
      :    [PATCH] vmscan: balancing fix
      :
      :    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
      :
      :      The shrink_zone() logic can, under some circumstances, cause far too many
      :      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
      :      hit a large number of reclaimable pages on the LRU.
      :
      :      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
      :      reclaimed.
      :
      :    Problem is, this change caused significant imbalance in inter-zone scan
      :    balancing by truncating scans of larger zones.
      :
      :    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
      :    balancing algorithm would require that if we're scanning 100 pages of
      :    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
      :    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
      :    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
      :    harder than large ones.
      :
      :    Now I need to remember what the workload was which caused me to write this
      :    patch originally, then fix it up in a different way...
      
      And we haven't demonstrated that whatever problem caused that reversion is
      not being reintroduced by this change in 2008.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79311c1
    • H
      hugetlb: fix sparse warnings · ebdd4aea
      Hannes Eder 提交于
      Fix the following sparse warnings:
      
        mm/hugetlb.c:375:3: warning: returning void-valued expression
        mm/hugetlb.c:408:3: warning: returning void-valued expression
      Signed-off-by: NHannes Eder <hannes@hanneseder.net>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebdd4aea
    • H
      swapfile: let others seed random · f0d7a4b3
      Hugh Dickins 提交于
      Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
      there's been a coincidental discussion of earlier randomization, assume
      that goes ahead, let swapon be a client rather than stirring for itself.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d7a4b3
    • H
      swapfile: change discard pgoff_t to sector_t · 858a2990
      Hugh Dickins 提交于
      Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
      sector_t: given the constraints on swap offsets (in particular, the 5 bits
      of swap type accommodated in the same unsigned long), pgoff_t was actually
      safe as is, but it certainly looked worrying when shifted left.
      
      [akpm@linux-foundation.org: fix shift overflow]
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858a2990
    • H
      swapfile: swap allocation cycle if nonrot · c60aa176
      Hugh Dickins 提交于
      Though attempting to find free clusters (Andrea), swap allocation has
      always restarted its searches from the beginning of the swap area (sct),
      to reduce seek times between swap pages, by not scattering them all over
      the partition.
      
      But on a solidstate swap device, seeks are cheap, and block remapping to
      level the wear may be limited by zones: in that case it's better to cycle
      around the whole partition.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c60aa176
    • H
      swapfile: swapon randomize if nonrot · 20137a49
      Hugh Dickins 提交于
      Swap allocation has always started from the beginning of the swap area;
      but if we're dealing with a solidstate swap device which can only remap
      blocks within limited zones, that would sooner wear out the first zone.
      
      Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
      randomize the cluster_next starting position for allocation.
      
      If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
      with an "SS" at the right end of the kernel's "Adding ...  swap" message
      (so that if it's both nonrot and discardable, "SSD" will be shown there).
      Perhaps something should be shown in /proc/swaps (swapon -s), but we have
      to be more cautious before making any addition to that format.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20137a49