1. 17 10月, 2013 1 次提交
    • K
      swap: fix set_blocksize race during swapon/swapoff · 5b808a23
      Krzysztof Kozlowski 提交于
      Fix race between swapoff and swapon.  Swapoff used old_block_size from
      swap_info outside of swapon_mutex so it could be overwritten by
      concurrent swapon.
      
      The race has visible effect only if more than one swap block device
      exists with different block sizes (e.g.  /dev/sda1 with block size 4096
      and /dev/sdb1 with 512).  In such case it leads to setting the blocksize
      of swapped off device with wrong blocksize.
      
      The bug can be triggered with multiple concurrent swapoff and swapon:
      0. Swap for some device is on.
      1. swapoff:
      First the swapoff is called on this device and "struct swap_info_struct
      *p" is assigned. This is done under swap_lock however this lock is
      released for the call try_to_unuse().
      
      2. swapon:
      After the assignment above (and before acquiring swapon_mutex &
      swap_lock by swapoff) the swapon is called on the same device.
      The p->old_block_size is assigned to the value of block_size the device.
      This block size should be the same as previous but sometimes it is not.
      The swapon ends successfully.
      
      3. swapoff:
      Swapoff resumes, grabs the locks and mutex and continues to disable this
      swap device. Now it sets the block size to value taken from swap_info
      which was overwritten by swapon in 2.
      Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Reported-by: NWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b808a23
  2. 12 9月, 2013 6 次提交
    • S
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li 提交于
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • S
      swap: fix races exposed by swap discard · edfe23da
      Shaohua Li 提交于
      The previous patch can expose races, according to Hugh:
      
      swapoff was sometimes failing with "Cannot allocate memory", coming from
      try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
      on a free entry temporarily SWAP_MAP_BAD while being discarded.
      
      We should use ACCESS_ONCE() there, and whenever accessing swap_map
      locklessly; but rather than peppering it throughout try_to_unuse(), just
      declare *swap_map with volatile.
      
      try_to_unuse() is accustomed to *swap_map going down racily, but not
      necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
      prevent that transition once SWP_WRITEOK is switched off, when it's a
      waste of time to issue discards anyway (swapon can do a whole discard).
      
      Another issue is:
      
      In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
      because we don't check if readahead swap entry is bad.  This doesn't break
      anything but such swapin page is wasteful and can only be freed at page
      reclaim.  We should avoid read such swap entry.  And in discard, we mark
      swap entry SWAP_MAP_BAD and then switch it to normal when discard is
      finished.  If readahead reads such swap entry, we have the same issue, so
      we much check if swap entry is bad too.
      
      Thanks Hugh to inspire swapin_readahead could use bad swap entry.
      
      [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edfe23da
    • S
      swap: make swap discard async · 815c2c54
      Shaohua Li 提交于
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • S
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li 提交于
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449
    • A
      mm/swapfile.c: convert to pr_foo() · 465c47fd
      Andrew Morton 提交于
      A few 80-col gymnastics were cleaned up as a result.
      
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      465c47fd
    • R
      swap: warn when a swap area overflows the maximum size · d6bbbd29
      Raymond Jennings 提交于
      It is possible to swapon a swap area that is too big for the pte width
      to handle.
      
      Presently this failure happens silently.
      
      Instead, emit a diagnostic to warn the user.
      
      Testing results, root prompt commands and kernel log messages:
      
      # lvresize /dev/system/swap --size 16G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Adding 16777212k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:16777212k
      
      # lvresize /dev/system/swap --size 64G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Truncating oversized swap area, only
      using 33554432k out of 67108860k
      Jul  7 04:27:22 warfang kernel: Adding 33554428k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:33554428k
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NRaymond Jennings <shentino@gmail.com>
      Acked-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6bbbd29
  3. 14 8月, 2013 1 次提交
  4. 04 7月, 2013 1 次提交
    • R
      swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES · dcf6b7dd
      Rafael Aquini 提交于
      Considering the use cases where the swap device supports discard:
      a) and can do it quickly;
      b) but it's slow to do in small granularities (or concurrent with other
         I/O);
      c) but the implementation is so horrendous that you don't even want to
         send one down;
      
      And assuming that the sysadmin considers it useful to send the discards down
      at all, we would (probably) want the following solutions:
      
        i. do the fine-grained discards for freed swap pages, if device is
           capable of doing so optimally;
       ii. do single-time (batched) swap area discards, either at swapon
           or via something like fstrim (not implemented yet);
      iii. allow doing both single-time and fine-grained discards; or
       iv. turn it off completely (default behavior)
      
      As implemented today, one can only enable/disable discards for swap, but
      one cannot select, for instance, solution (ii) on a swap device like (b)
      even though the single-time discard is regarded to be interesting, or
      necessary to the workload because it would imply (1), and the device is
      not capable of performing it optimally.
      
      This patch addresses the scenario depicted above by introducing a way to
      ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
      flagged through swapon(8) to allow a sysadmin to select the best suitable
      swap discard policy accordingly to system constraints.
      
      This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
      new flags to allow more flexibe swap discard policies being flagged
      through swapon(8).  The default behavior is to keep both single-time, or
      batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
      for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
      consistentcy with older kernel behavior, as well as maintain compatibility
      with older swapon(8).  However, through the new introduced flags the best
      suitable discard policy can be selected accordingly to any given swap
      device constraint.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6b7dd
  5. 13 6月, 2013 1 次提交
    • A
      frontswap: fix incorrect zeroing and allocation size for frontswap_map · 7b57976d
      Akinobu Mita 提交于
      The bitmap accessed by bitops must have enough size to hold the required
      numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
      bitmap must not be zeroed by memset() if the number of bits cleared is
      not a multiple of BITS_PER_LONG.
      
      This fixes incorrect zeroing and allocation size for frontswap_map.  The
      incorrect zeroing part doesn't cause any problem because frontswap_map
      is freed just after zeroing.  But the wrongly calculated allocation size
      may cause the problem.
      
      For 32bit systems, the allocation size of frontswap_map is about twice
      as large as required size.  For 64bit systems, the allocation size is
      smaller than requeired if the number of bits is not a multiple of
      BITS_PER_LONG.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b57976d
  6. 01 5月, 2013 1 次提交
    • M
      frontswap: get rid of swap_lock dependency · 4f89849d
      Minchan Kim 提交于
      Frontswap initialization routine depends on swap_lock, which want to be
      atomic about frontswap's first appearance.  IOW, frontswap is not present
      and will fail all calls OR frontswap is fully functional but if new
      swap_info_struct isn't registered by enable_swap_info, swap subsystem
      doesn't start I/O so there is no race between init procedure and page I/O
      working on frontswap.
      
      So let's remove unnecessary swap_lock dependency.
      
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      [v1: Rebased on my branch, reworked to work with backends loading late]
      [v2: Added a check for !map]
      [v3: Made the invalidate path follow the init path]
      [v4: Address comments by Wanpeng Li <liwanp@linux.vnet.ibm.com>]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad@darnok.org>
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f89849d
  7. 30 4月, 2013 1 次提交
  8. 24 2月, 2013 3 次提交
    • H
      mm,ksm: swapoff might need to copy · 9e16b7fb
      Hugh Dickins 提交于
      Before establishing that KSM page migration was the cause of my
      WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
      lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
      many respects is equivalent to faulting in a page.
      
      In fact I've never caught that as the cause: but in theory it does at
      least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
      avoid bringing a KSM page back in when it's not supposed to be.
      
      I intended to copy how it's done in do_swap_page(), but have a strong
      aversion to how "swapcache" ends up being used there: rework it with
      "page != swapcache".
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e16b7fb
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • S
      swap: make each swap partition have one address_space · 33806f06
      Shaohua Li 提交于
      When I use several fast SSD to do swap, swapper_space.tree_lock is
      heavily contended.  This makes each swap partition have one
      address_space to reduce the lock contention.  There is an array of
      address_space for swap.  The swap entry type is the index to the array.
      
      In my test with 3 SSD, this increases the swapout throughput 20%.
      
      [akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33806f06
  9. 23 2月, 2013 1 次提交
  10. 12 12月, 2012 4 次提交
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • C
      mm: do not call frontswap_init() during swapoff · 6555bc03
      Cesar Eduardo Barros 提交于
      The call to frontswap_init() was added within enable_swap_info(), which
      was called not only during sys_swapon, but also to reinsert the swap_info
      into the swap_list in case of failure of try_to_unuse() within
      sys_swapoff.  This means that frontswap_init() might be called more than
      once for the same swap area.
      
      While as far as I could see no frontswap implementation has any problem
      with it (and in fact, all the ones I found ignore the parameter passed to
      frontswap_init), this could change in the future.
      
      To prevent future problems, move the call to frontswap_init() to outside
      the code shared between sys_swapon and sys_swapoff.
      Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6555bc03
    • C
      mm: refactor reinsert of swap_info in sys_swapoff() · cf0cac0a
      Cesar Eduardo Barros 提交于
      The block within sys_swapoff() which re-inserts the swap_info into the
      swap_list in case of failure of try_to_unuse() reads a few values outside
      the swap_lock.  While this is safe at that point, it is subtle code.
      
      Simplify the code by moving the reading of these values to a separate
      function, refactoring it a bit so they are read from within the swap_lock.
       This is easier to understand, and matches better the way it worked before
      I unified the insertion of the swap_info from both sys_swapon and
      sys_swapoff.
      
      This change should make no functional difference.  The only real change is
      moving the read of two or three structure fields to within the lock
      (frontswap_map_get() is nothing more than a read of p->frontswap_map).
      Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf0cac0a
  11. 17 11月, 2012 1 次提交
  12. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  13. 01 8月, 2012 5 次提交
    • J
      mm: swapfile: clean up unuse_pte race handling · 5d84c776
      Johannes Weiner 提交于
      The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
      the function would continue to reestablish the page even after
      mem_cgroup_try_charge_swapin() failed.  After 85d9fc89 "memcg: fix refcnt
      handling at swapoff", the condition is always true when this code is
      reached.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d84c776
    • M
      swapfile: avoid dereferencing bd_disk during swap_entry_free for network storage · 73744923
      Mel Gorman 提交于
      Commit b3a27d ("swap: Add swap slot free callback to
      block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
      dereference if using swap-over-NFS.  This patch checks SWP_BLKDEV on the
      swap_info_struct before dereferencing.
      
      With reference to this callback, Christoph Hellwig stated "Please just
      remove the callback entirely.  It has no user outside the staging tree and
      was added clearly against the rules for that staging tree".  This would
      also be my preference but there was not an obvious way of keeping zram in
      staging/ happy.
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73744923
    • M
      mm: swap: implement generic handler for swap_activate · a509bc1a
      Mel Gorman 提交于
      The version of swap_activate introduced is sufficient for swap-over-NFS
      but would not provide enough information to implement a generic handler.
      This patch shuffles things slightly to ensure the same information is
      available for aops->swap_activate() as is available to the core.
      
      No functionality change.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a509bc1a
    • M
      mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages · 62c230bc
      Mel Gorman 提交于
      Currently swapfiles are managed entirely by the core VM by using ->bmap to
      allocate space and write to the blocks directly.  This effectively ensures
      that the underlying blocks are allocated and avoids the need for the swap
      subsystem to locate what physical blocks store offsets within a file.
      
      If the swap subsystem is to use the filesystem information to locate the
      blocks, it is critical that information such as block groups, block
      bitmaps and the block descriptor table that map the swap file were
      resident in memory.  This patch adds address_space_operations that the VM
      can call when activating or deactivating swap backed by a file.
      
        int swap_activate(struct file *);
        int swap_deactivate(struct file *);
      
      The ->swap_activate() method is used to communicate to the file that the
      VM relies on it, and the address_space should take adequate measures such
      as reserving space in the underlying device, reserving memory for mempools
      and pinning information such as the block descriptor table in memory.  The
      ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
      returned success.
      
      After a successful swapfile ->swap_activate, the swapfile is marked
      SWP_FILE and swapper_space.a_ops will proxy to
      sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
      pages and ->readpage to read.
      
      It is perfectly possible that direct_IO be used to read the swap pages but
      it is an unnecessary complication.  Similarly, it is possible that
      ->writepage be used instead of direct_io to write the pages but filesystem
      developers have stated that calling writepage from the VM is undesirable
      for a variety of reasons and using direct_IO opens up the possibility of
      writing back batches of swap pages in the future.
      
      [a.p.zijlstra@chello.nl: Original patch]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62c230bc
    • M
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman 提交于
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
  14. 16 6月, 2012 1 次提交
    • H
      swap: fix shmem swapping when more than 8 areas · 9b15b817
      Hugh Dickins 提交于
      Minchan Kim reports that when a system has many swap areas, and tmpfs
      swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
      back the page cannot locate it, and the read fails with -ENOMEM.
      
      Whoops.  Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
      swp_entry_to_pte()) technique for determining maximum usable swap
      offset, without stopping to realize that that actually depends upon the
      pte swap encoding shifting swap offset to the higher bits and truncating
      it there.  Whereas our radix_tree swap encoding leaves offset in the
      lower bits: it's swap "type" (that is, index of swap area) that was
      truncated.
      
      Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
      broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
      
      This does not reduce the usable size of a swap area any further, it
      leaves it as claimed when making the original commit: no change from 3.0
      on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
      per swapfile on i386 with PAE.  It's not a change I would have risked
      five years ago, but with x86_64 supported for ten years, I believe it's
      appropriate now.
      
      Hmm, and what if some architecture implements its swap pte with offset
      encoded below type? That would equally break the maximum usable swap
      offset check.  Happily, they all follow the same tradition of encoding
      offset above type, but I'll prepare a check on that for next.
      Reported-and-Reviewed-and-Tested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b15b817
  15. 30 5月, 2012 2 次提交
    • K
      memcg: fix/change behavior of shared anon at moving task · 4b91355e
      KAMEZAWA Hiroyuki 提交于
      This patch changes memcg's behavior at task_move().
      
      At task_move(), the kernel scans a task's page table and move the changes
      for mapped pages from source cgroup to target cgroup.  There has been a
      bug at handling shared anonymous pages for a long time.
      
      Before patch:
        - The spec says 'shared anonymous pages are not moved.'
        - The implementation was 'shared anonymoys pages may be moved'.
          If page_mapcount <=2, shared anonymous pages's charge were moved.
      
      After patch:
        - The spec says 'all anonymous pages are moved'.
        - The implementation is 'all anonymous pages are moved'.
      
      Considering usage of memcg, this will not affect user's experience.
      'shared anonymous' pages only exists between a tree of processes which
      don't do exec().  Moving one of process without exec() seems not sane.
      For example, libcgroup will not be affected by this change.  (Anyway, no
      one noticed the implementation for a long time...)
      
      Below is a discussion log:
      
       - current spec/implementation are complex
       - Now, shared file caches are moved
       - It adds unclear check as page_mapcount(). To do correct check,
         we should check swap users, etc.
       - No one notice this implementation behavior. So, no one get benefit
         from the design.
       - In general, once task is moved to a cgroup for running, it will not
         be moved....
       - Finally, we have control knob as memory.move_charge_at_immigrate.
      
      Here is a patch to allow moving shared pages, completely. This makes
      memcg simpler and fix current broken code.
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b91355e
    • H
      shmem: replace page if mapping excludes its zone · bde05d1c
      Hugh Dickins 提交于
      The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
      backing RAM has to be below 4GB.  Not a problem while the boards
      supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
      their GMA3600 is managed by the GMA500 driver.
      
      shmem/tmpfs has never pretended to support hardware restrictions on the
      backing memory, but it might have appeared to do so before v3.1, and
      even now it works fine until a page is swapped out then back in.  When
      read_cache_page_gfp() supplied a freshly allocated page for copy, that
      compensated for whatever choice might have been made by earlier swapin
      readahead; but swapoff was likely to destroy the illusion.
      
      We'd like to continue to support GMA500, so now add a new
      shmem_should_replace_page() check on the zone when about to move a page
      from swapcache to filecache (in swapin and swapoff cases), with
      shmem_replace_page() to allocate and substitute a suitable page (given
      gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
      
      This does involve a minor extension to mem_cgroup_replace_page_cache()
      (the page may or may not have already been charged); and I've removed a
      comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
      always a no-op while PageSwapCache.
      
      Also removed optimization of an unlikely path in shmem_getpage_gfp(),
      now that we need to check PageSwapCache more carefully (a racing caller
      might already have made the copy).  And at one point shmem_unuse_inode()
      needs to use the hitherto private page_swapcount(), to guard against
      racing with inode eviction.
      
      It would make sense to extend shmem_should_replace_page(), to cover
      cpuset and NUMA mempolicy restrictions too, but set that aside for now:
      needs a cleanup of shmem mempolicy handling, and more testing, and ought
      to handle swap faults in do_swap_page() as well as shmem.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde05d1c
  16. 15 5月, 2012 1 次提交
    • D
      mm: frontswap: core swap subsystem hooks and headers · 38b5faf4
      Dan Magenheimer 提交于
      This patch, 2of4, contains the changes to the core swap subsystem.
      This includes:
      
      (1) makes available core swap data structures (swap_lock, swap_list and
      swap_info) that are needed by frontswap.c but we don't need to expose them
      to the dozens of files that include swap.h so we create a new swapfile.h
      just to extern-ify these and modify their declarations to non-static
      
      (2) adds frontswap-related elements to swap_info_struct.  Frontswap_map
      points to vzalloc'ed one-bit-per-swap-page metadata that indicates
      whether the swap page is in frontswap or in the device and frontswap_pages
      counts how many pages are in frontswap.
      
      (3) adds hooks in the swap subsystem and extends try_to_unuse so that
      frontswap_shrink can do a "partial swapoff".
      
      Note that a failed frontswap_map allocation is safe... failure is noted
      by lack of "FS" in the subsequent printk.
      
      ---
      
      [v14: rebase to 3.4-rc2]
      [v10: no change]
      [v9: akpm@linux-foundation.org: mark some statics __read_mostly]
      [v9: akpm@linux-foundation.org: add clarifying comments]
      [v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
      [v9: error27@gmail.com: remove superfluous check for NULL]
      [v8: rebase to 3.0-rc4]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
      [v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
      [v7: rebase to 3.0-rc3]
      [v7: JBeulich@novell.com: add new swap struct elements only if config'd]
      [v6: rebase to 3.0-rc1]
      [v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
      [v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
      [v5: no change from v4]
      [v4: rebase to 2.6.39]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJan Beulich <JBeulich@novell.com>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v11: Rebased, fixed mm/swapfile.c context change]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      38b5faf4
  17. 29 3月, 2012 1 次提交
  18. 22 3月, 2012 3 次提交
    • S
      swap: don't do discard if no discard option added · 052b1987
      Shaohua Li 提交于
      When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
      will still perform a discard operation.  This can cause problems if
      discard is slow or buggy.
      
      Reverse the order of the check so that a discard operation is performed
      only if the sys_swapon() caller is attempting to enable discard.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Tested-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      052b1987
    • R
      mm: make swapin readahead skip over holes · 67f96aa2
      Rik van Riel 提交于
      Ever since abandoning the virtual scan of processes, for scalability
      reasons, swap space has been a little more fragmented than before.  This
      can lead to the situation where a large memory user is killed, swap space
      ends up full of "holes" and swapin readahead is totally ineffective.
      
      On my home system, after killing a leaky firefox it took over an hour to
      page just under 2GB of memory back in, slowing the virtual machines down
      to a crawl.
      
      This patch makes swapin readahead simply skip over holes, instead of
      stopping at them.  This allows the system to swap things back in at rates
      of several MB/second, instead of a few hundred kB/second.
      
      The checks done in valid_swaphandles are already done in
      read_swap_cache_async as well, allowing us to remove a fair amount of
      code.
      
      [akpm@linux-foundation.org: fix it for page_cluster >= 32]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Adrian Drzewiecki <z@drze.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67f96aa2
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  19. 20 3月, 2012 1 次提交
  20. 14 2月, 2012 1 次提交
  21. 13 1月, 2012 1 次提交
  22. 11 1月, 2012 1 次提交
    • M
      mm: avoid livelock on !__GFP_FS allocations · f90ac398
      Mel Gorman 提交于
      Colin Cross reported;
      
        Under the following conditions, __alloc_pages_slowpath can loop forever:
        gfp_mask & __GFP_WAIT is true
        gfp_mask & __GFP_FS is false
        reclaim and compaction make no progress
        order <= PAGE_ALLOC_COSTLY_ORDER
      
        These conditions happen very often during suspend and resume,
        when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
        allocations into __GFP_WAIT.
      
        The oom killer is not run because gfp_mask & __GFP_FS is false,
        but should_alloc_retry will always return true when order is less
        than PAGE_ALLOC_COSTLY_ORDER.
      
      In his fix, he avoided retrying the allocation if reclaim made no progress
      and __GFP_FS was not set.  The problem is that this would result in
      GFP_NOIO allocations failing that previously succeeded which would be very
      unfortunate.
      
      The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
      behave like GFP_NOIO is that normally flushers will be cleaning pages and
      kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
      The same does not necessarily apply during suspend as the storage device
      may be suspended.
      
      This patch special cases the suspend case to fail the page allocation if
      reclaim cannot make progress and adds some documentation on how
      gfp_allowed_mask is currently used.  Failing allocations like this may
      cause suspend to abort but that is better than a livelock.
      
      [mgorman@suse.de: Rework fix to be suspend specific]
      [rientjes@google.com: Move suspended device check to should_alloc_retry]
      Reported-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90ac398