1. 13 9月, 2013 5 次提交
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
    • J
      arch: mm: pass userspace fault flag to generic fault handler · 759496ba
      Johannes Weiner 提交于
      Unlike global OOM handling, memory cgroup code will invoke the OOM killer
      in any OOM situation because it has no way of telling faults occuring in
      kernel context - which could be handled more gracefully - from
      user-triggered faults.
      
      Pass a flag that identifies faults originating in user space from the
      architecture-specific fault handlers to generic code so that memcg OOM
      handling can be improved.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      759496ba
    • M
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko 提交于
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • M
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko 提交于
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f
    • M
      memcg, vmscan: integrate soft reclaim tighter with zone shrinking code · 3b38722e
      Michal Hocko 提交于
      This patchset is sitting out of tree for quite some time without any
      objections.  I would be really happy if it made it into 3.12.  I do not
      want to push it too hard but I think this work is basically ready and
      waiting more doesn't help.
      
      The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
      first step and get rid of the previous soft reclaim infrastructure.
      shrink_zone is done in two passes now.  First it tries to do the soft
      limit reclaim and it falls back to reclaim-all mode if no group is over
      the limit or no pages have been scanned.  The second pass happens at the
      same priority so the only time we waste is the memcg tree walk which has
      been updated in the third step to have only negligible overhead.
      
      As a bonus we will get rid of a _lot_ of code by this and soft reclaim
      will not stand out like before when it wasn't integrated into the zone
      shrinking code and it reclaimed at priority 0 (the testing results show
      that some workloads suffers from such an aggressive reclaim).  The clean
      up is in a separate patch because I felt it would be easier to review that
      way.
      
      The second step is soft limit reclaim integration into targeted reclaim.
      It should be rather straight forward.  Soft limit has been used only for
      the global reclaim so far but it makes sense for any kind of pressure
      coming from up-the-hierarchy, including targeted reclaim.
      
      The third step (patches 4-8) addresses the tree walk overhead by enhancing
      memcg iterators to enable skipping whole subtrees and tracking number of
      over soft limit children at each level of the hierarchy.  This information
      is updated same way the old soft limit tree was updated (from
      memcg_check_events) so we shouldn't see an additional overhead.  In fact
      mem_cgroup_update_soft_limit is much simpler than tree manipulation done
      previously.
      
      __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
      mem_cgroup_iter so the decision whether a particular group should be
      visited is done at the iterator level which allows us to decide to skip
      the whole subtree as well (if there is no child in excess).  This reduces
      the tree walk overhead considerably.
      
      * TEST 1
      ========
      
      My primary test case was a parallel kernel build with 2 groups (make is
      running with -j8 with a distribution .config in a separate cgroup without
      any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
      run taskset to Node 0 cpus.
      
      I was mostly interested in 2 setups.  Default - no soft limit set and -
      and 0 soft limit set to both groups.  The first one should tell us whether
      the rework regresses the default behavior while the second one should show
      us improvements in an extreme case where both workloads are always over
      the soft limit.
      
      /usr/bin/time -v has been used to collect the statistics and each
      configuration had 3 runs after fresh boot without any other load on the
      system.
      
      base is mmotm-2013-07-18-16-40
      rework all 8 patches applied on top of base
      
      * No-limit
      User
      no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
      no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
      System
      no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
      no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
      Elapsed
      no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
      no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
      
      The results are within noise. Elapsed time has a bigger variance but the
      average looks good.
      
      * 0-limit
      User
      0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
      0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
      System
      0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
      0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
      Elapsed
      0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
      0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
      
      The improvement is really huge here (even bigger than with my previous
      testing and I suspect that this highly depends on the storage).  Page
      fault statistics tell us at least part of the story:
      
      Minor
      0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
      0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
      Major
      0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
      0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
      
      Same as with my previous testing Minor faults are more or less within
      noise but Major fault count is way bellow the base kernel.
      
      While this looks as a nice win it is fair to say that 0-limit
      configuration is quite artificial. So I was playing with 0-no-limit
      loads as well.
      
      * TEST 2
      ========
      
      The following results are from 2 groups configuration on a 16GB machine
      (single NUMA node).
      
      - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
        2*TotalMem with 0 soft limit.
      - B running a mem_eater which consumes TotalMem-1G without any limit. The
        mem_eater consumes the memory in 100 chunks with 1s nap after each
        mmap+poppulate so that both loads have chance to fight for the memory.
      
      The expected result is that B shouldn't be reclaimed and A shouldn't see
      a big dropdown in elapsed time.
      
      User
      base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
      rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
      System
      base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
      rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
      Elapsed
      base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
      rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
      
      System time improved slightly as well as Elapsed. My previous testing
      has shown worse numbers but this again seem to depend on the storage
      speed.
      
      My theory is that the writeback doesn't catch up and prio-0 soft reclaim
      falls into wait on writeback page too often in the base kernel. The
      patched kernel doesn't do that because the soft reclaim is done from the
      kswapd/direct reclaim context. This can be seen on the following graph
      nicely. The A's group usage_in_bytes regurarly drops really low very often.
      
      All 3 runs
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
      resp. a detail of the single run
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
      
      mem_eater seems to be doing better as well. It gets to the full
      allocation size faster as can be seen on the following graph:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
      
      /proc/meminfo collected during the test also shows that rework kernel
      hasn't swapped that much (well almost not at all):
      base: max: 123900 K avg: 56388.29 K
      rework: max: 300 K avg: 128.68 K
      
      kswapd and direct reclaim statistics are of no use unfortunatelly because
      soft reclaim is not accounted properly as the counters are hidden by
      global_reclaim() checks in the base kernel.
      
      * TEST 3
      ========
      
      Another test was the same configuration as TEST2 except the stream IO was
      replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
      in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
      left.
      
      Kbuild did better with the rework kernel here as well:
      User
      base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
      rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
      System
      base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
      rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
      Elapsed
      base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
      rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
      Minor
      base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
      rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
      Major
      base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
      rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
      
      Again we can see a significant improvement in Elapsed (it also seems to
      be more stable), there is a huge dropdown for the Major page faults and
      much more swapping:
      base: max: 583736 K avg: 112547.43 K
      rework: max: 4012 K avg: 124.36 K
      
      Graphs from all three runs show the variability of the kbuild quite
      nicely.  It even seems that it took longer after every run with the base
      kernel which would be quite surprising as the source tree for the build is
      removed and caches are dropped after each run so the build operates on a
      freshly extracted sources everytime.
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
      
      My other testing shows that this is just a matter of timing and other runs
      behave differently the std for Elapsed time is similar ~50.  Example of
      other three runs:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
      
      So to wrap this up.  The series is still doing good and improves the soft
      limit.
      
      The testing results for bunch of cgroups with both stream IO and kbuild
      loads can be found in "memcg: track children in soft limit excess to
      improve soft limit".
      
      This patch:
      
      Memcg soft reclaim has been traditionally triggered from the global
      reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
      then picked up a group which exceeds the soft limit the most and reclaimed
      it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
      
      The infrastructure requires per-node-zone trees which hold over-limit
      groups and keep them up-to-date (via memcg_check_events) which is not cost
      free.  Although this overhead hasn't turned out to be a bottle neck the
      implementation is suboptimal because mem_cgroup_update_tree has no idea
      which zones consumed memory over the limit so we could easily end up
      having a group on a node-zone tree having only few pages from that
      node-zone.
      
      This patch doesn't try to fix node-zone trees management because it seems
      that integrating soft reclaim into zone shrinking sounds much easier and
      more appropriate for several reasons.  First of all 0 priority reclaim was
      a crude hack which might lead to big stalls if the group's LRUs are big
      and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
      should be applicable also to the targeted reclaim which is awkward right
      now without additional hacks.  Last but not least the whole infrastructure
      eats quite some code.
      
      After this patch shrink_zone is done in 2 passes.  First it tries to do
      the soft reclaim if appropriate (only for global reclaim for now to keep
      compatible with the original state) and fall back to ignoring soft limit
      if no group is eligible to soft reclaim or nothing has been scanned during
      the first pass.  Only groups which are over their soft limit or any of
      their parents up the hierarchy is over the limit are considered eligible
      during the first pass.
      
      Soft limit tree which is not necessary anymore will be removed in the
      follow up patch to make this patch smaller and easier to review.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b38722e
  2. 12 9月, 2013 35 次提交
    • S
      lz4: fix compression/decompression signedness mismatch · b34081f1
      Sergey Senozhatsky 提交于
      LZ4 compression and decompression functions require different in
      signedness input/output parameters: unsigned char for compression and
      signed char for decompression.
      
      Change decompression API to require "(const) unsigned char *".
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Kyungsik Lee <kyungsik.lee@lge.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yann Collet <yann.collet.73@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b34081f1
    • D
      ipc: rename ids->rw_mutex · d9a605e4
      Davidlohr Bueso 提交于
      Since in some situations the lock can be shared for readers, we shouldn't
      be calling it a mutex, rename it to rwsem.
      Signed-off-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9a605e4
    • R
      initmpfs: move rootfs code from fs/ramfs/ to init/ · 57f150a5
      Rob Landley 提交于
      When the rootfs code was a wrapper around ramfs, having them in the same
      file made sense.  Now that it can wrap another filesystem type, move it in
      with the init code instead.
      
      This also allows a subsequent patch to access rootfstype= command line
      arg.
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f150a5
    • J
      lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97
      Jan Kara 提交于
      With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
      one such possible user), the following race can happen:
      
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      <interrupt>
      ...
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      
      And we give out one radix tree node twice.  That clearly results in radix
      tree corruption with different results (usually OOPS) depending on which
      two users of radix tree race.
      
      We fix the problem by making radix_tree_node_alloc() always allocate fresh
      radix tree nodes when in interrupt.  Using preloading when in interrupt
      doesn't make sense since all the allocations have to be atomic anyway and
      we cannot steal nodes from process-context users because some users rely
      on radix_tree_insert() succeeding after radix_tree_preload().
      in_interrupt() check is somewhat ugly but we cannot simply key off passed
      gfp_mask as that is acquired from root_gfp_mask() and thus the same for
      all preload users.
      
      Another part of the fix is to avoid node preallocation in
      radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
      preallocation in such case doesn't make sense and when preallocation would
      happen in interrupt we could possibly leak some allocated nodes.  However,
      some users of radix_tree_preload() require following radix_tree_insert()
      to succeed.  To avoid unexpected effects for these users,
      radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
      and we provide a new function radix_tree_maybe_preload() for those users
      which get different gfp mask from different call sites and which are
      prepared to handle radix_tree_insert() failure.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e4c0d97
    • C
      rbtree: add rbtree_postorder_for_each_entry_safe() helper · 2b529089
      Cody P Schafer 提交于
      Because deletion (of the entire tree) is a relatively common use of the
      rbtree_postorder iteration, and because doing it safely means fiddling
      with temporary storage, provide a helper to simplify postorder rbtree
      iteration.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: David Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b529089
    • C
      rbtree: add postorder iteration functions · 9dee5c51
      Cody P Schafer 提交于
      Postorder iteration yields all of a node's children prior to yielding the
      node itself, and this particular implementation also avoids examining the
      leaf links in a node after that node has been yielded.
      
      In what I expect will be its most common usage, postorder iteration allows
      the deletion of every node in an rbtree without modifying the rbtree nodes
      (no _requirement_ that they be nulled) while avoiding referencing child
      nodes after they have been "deleted" (most commonly, freed).
      
      I have only updated zswap to use this functionality at this point, but
      numerous bits of code (most notably in the filesystem drivers) use a hand
      rolled postorder iteration that NULLs child links as it traverses the
      tree.  Each of those instances could be replaced with this common
      implementation.
      
      1 & 2 add rbtree postorder iteration functions.
      3 adds testing of the iteration to the rbtree runtime tests
      4 allows building the rbtree runtime tests as builtins
      5 updates zswap.
      
      This patch:
      
      Add postorder iteration functions for rbtree.  These are useful for safely
      freeing an entire rbtree without modifying the tree at all.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: David Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dee5c51
    • M
      vmcore: introduce remap_oldmem_pfn_range() · 9cb21813
      Michael Holzheu 提交于
      For zfcpdump we can't map the HSA storage because it is only available via
      a read interface.  Therefore, for the new vmcore mmap feature we have
      introduce a new mechanism to create mappings on demand.
      
      This patch introduces a new architecture function remap_oldmem_pfn_range()
      that should be used to create mappings with remap_pfn_range() for oldmem
      areas that can be directly mapped.  For zfcpdump this is everything
      besides of the HSA memory.  For the areas that are not mapped by
      remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
      handler mmap_vmcore_fault() is called.
      
      This handler works as follows:
      
      * Get already available or new page from page cache (find_or_create_page)
      * Check if /proc/vmcore page is filled with data (PageUptodate)
      * If yes:
        Return that page
      * If no:
        Fill page using __vmcore_read(), set PageUptodate, and return page
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cb21813
    • M
      vmcore: introduce ELF header in new memory feature · be8a8d06
      Michael Holzheu 提交于
      For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
      (zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
      a hypervisor owned memory area (HSA) before the kdump kernel is booted.
      When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.
      
      The advantages of this mechanism are:
      
       * No crashkernel memory has to be defined in the old kernel.
       * Early boot problems (before kexec_load has been done) can be dumped
       * Non-Linux systems can be dumped.
      
      We modify the s390 copy_oldmem_page() function to read from the HSA memory
      if memory below HSA_SIZE bytes is requested.
      
      Since we cannot use the kexec tool to load the kernel in this scenario,
      we have to build the ELF header in the 2nd (kdump/new) kernel.
      
      So with the following patch set we would like to introduce the new
      function that the ELF header for /proc/vmcore can be created in the 2nd
      kernel memory.
      
      The following steps are done during zfcpdump execution:
      
      1.  Production system crashes
      2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
      3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
      4.  Boot loader loads kernel into low memory area
      5.  Kernel boots and uses only HSA_SIZE bytes of memory
      6.  Kernel saves registers of non-boot CPUs
      7.  Kernel does memory detection for dump memory map
      8.  Kernel creates ELF header for /proc/vmcore
      9.  /proc/vmcore uses this header for initialization
      10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
          - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
          - copy_oldmem_page() copies from real memory for memory above HSA_SIZE
      
      Currently for s390 we create the ELF core header in the 2nd kernel with a
      small trick.  We relocate the addresses in the ELF header in a way that
      for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
      and the read_from_oldmem() returns the correct data.  This allows the
      /proc/vmcore code to use the ELF header in the 2nd kernel.
      
      This patch:
      
      Exchange the old mechanism with the new and much cleaner function call
      override feature that now offcially allows to create the ELF core header
      in the 2nd kernel.
      
      To use the new feature the following function have to be defined
      by the architecture backend code to read from new memory:
      
       * elfcorehdr_alloc: Allocate ELF header
       * elfcorehdr_free: Free the memory of the ELF header
       * elfcorehdr_read: Read from ELF header
       * elfcorehdr_read_notes: Read from ELF notes
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be8a8d06
    • O
      exec: kill "int depth" in search_binary_handler() · 131b2f9f
      Oleg Nesterov 提交于
      Nobody except search_binary_handler() should touch ->recursion_depth, "int
      depth" buys nothing but complicates the code, kill it.
      
      Probably we should also kill "fn" and the !NULL check, ->load_binary
      should be always defined.  And it can not go away after read_unlock() or
      this code is buggy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      131b2f9f
    • H
      kprobes: allow to specify custom allocator for insn caches · af96397d
      Heiko Carstens 提交于
      The current two insn slot caches both use module_alloc/module_free to
      allocate and free insn slot cache pages.
      
      For s390 this is not sufficient since there is the need to allocate insn
      slots that are either within the vmalloc module area or within dma memory.
      
      Therefore add a mechanism which allows to specify an own allocator for an
      own insn slot cache.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af96397d
    • H
      kprobes: unify insn caches · c802d64a
      Heiko Carstens 提交于
      The current kpropes insn caches allocate memory areas for insn slots
      with module_alloc().  The assumption is that the kernel image and module
      area are both within the same +/- 2GB memory area.
      
      This however is not true for s390 where the kernel image resides within
      the first 2GB (DMA memory area), but the module area is far away in the
      vmalloc area, usually somewhere close below the 4TB area.
      
      For new pc relative instructions s390 needs insn slots that are within
      +/- 2GB of each area.  That way we can patch displacements of
      pc-relative instructions within the insn slots just like x86 and
      powerpc.
      
      The module area works already with the normal insn slot allocator,
      however there is currently no way to get insn slots that are within the
      first 2GB on s390 (aka DMA area).
      
      Therefore this patch set modifies the kprobes insn slot cache code in
      order to allow to specify a custom allocator for the insn slot cache
      pages.  In addition architecure can now have private insn slot caches
      withhout the need to modify common code.
      
      Patch 1 unifies and simplifies the current insn and optinsn caches
              implementation. This is a preparation which allows to add more
              insn caches in a simple way.
      
      Patch 2 adds the possibility to specify a custom allocator.
      
      Patch 3 makes s390 use the new insn slot mechanisms and adds support for
              pc-relative instructions with long displacements.
      
      This patch (of 3):
      
      The two insn caches (insn, and optinsn) each have an own mutex and
      alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).
      
      Since there is the need for yet another insn cache which satifies dma
      allocations on s390, unify and simplify the current implementation:
      
      - Move the per insn cache mutex into struct kprobe_insn_cache.
      - Move the alloc/free functions to kprobe.h so they are simply
        wrappers for the generic __get_insn_slot/__free_insn_slot functions.
        The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
        which provides the alloc/free functions for each cache if needed.
      - move the struct kprobe_insn_cache to kprobe.h which allows to generate
        architecture specific insn slot caches outside of the core kprobes
        code.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c802d64a
    • S
      syscalls.h: add forward declarations for inplace syscall wrappers · f9597f24
      Sergei Trofimovich 提交于
      Unclutter -Wmissing-prototypes warning types (enabled at make W=1)
      
          linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
            asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                            ^
          linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
            __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
            ^
      by adding forward declarations right before definitions.
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9597f24
    • D
      smp.h: move !SMP version of on_each_cpu() out-of-line · bff2dc42
      David Daney 提交于
      All of the other non-trivial !SMP versions of functions in smp.h are
      out-of-line in up.c.  Move on_each_cpu() there as well.
      
      This allows us to get rid of the #include <linux/irqflags.h>.  The
      drawback is that this makes both the x86_64 and i386 defconfig !SMP
      kernels about 200 bytes larger each.
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bff2dc42
    • D
      smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond · fa688207
      David Daney 提交于
      As in commit f21afc25 ("smp.h: Use local_irq_{save,restore}() in
      !SMP version of on_each_cpu()"), we don't want to enable irqs if they
      are not already enabled.  There are currently no known problematical
      callers of these functions, but since it is a known failure pattern, we
      preemptively fix them.
      
      Since they are not trivial functions, make them non-inline by moving
      them to up.c.  This also makes it so we don't have to fix #include
      dependancies for preempt_{disable,enable}.
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa688207
    • W
      mm/hwpoison: don't need to hold compound lock for hugetlbfs page · f9121153
      Wanpeng Li 提交于
      compound lock is introduced by commit e9da73d6("thp: compound_lock."), it
      is used to serialize put_page against __split_huge_page_refcount().  In
      addition, transparent hugepages will be splitted in hwpoison handler and
      just one subpage will be poisoned.  There is unnecessary to hold compound
      lock for hugetlbfs page.  This patch replace compound_trans_order by
      compond_order in the place where the page is hugetlbfs page.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9121153
    • M
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov 提交于
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
    • W
      mm/writeback: make writeback_inodes_wb static · 7d9f073b
      Wanpeng Li 提交于
      It's not used globally and could be static.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d9f073b
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • V
      mm: munlock: manual pte walk in fast path instead of follow_page_mask() · 7a8010cd
      Vlastimil Babka 提交于
      Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
      each individual struct page.  This entails repeated full page table
      translations and page table lock taken for each page separately.
      
      This patch avoids the costly follow_page_mask() where possible, by
      iterating over ptes within single pmd under single page table lock.  The
      first pte is obtained by get_locked_pte() for non-THP page acquired by the
      initial follow_page_mask().  The rest of the on-stack pagevec for munlock
      is filled up using pte_walk as long as pte_present() and vm_normal_page()
      are sufficient to obtain the struct page.
      
      After this patch, a 14% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a8010cd
    • C
      mm: track vma changes with VM_SOFTDIRTY bit · d9104d1c
      Cyrill Gorcunov 提交于
      Pavel reported that in case if vma area get unmapped and then mapped (or
      expanded) in-place, the soft dirty tracker won't be able to recognize this
      situation since it works on pte level and ptes are get zapped on unmap,
      loosing soft dirty bit of course.
      
      So to resolve this situation we need to track actions on vma level, there
      VM_SOFTDIRTY flag comes in.  When new vma area created (or old expanded)
      we set this bit, and keep it here until application calls for clearing
      soft dirty bit.
      
      Thus when user space application track memory changes now it can detect if
      vma area is renewed.
      Reported-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9104d1c
    • Y
      memblock, numa: binary search node id · e76b63f8
      Yinghai Lu 提交于
      Current early_pfn_to_nid() on arch that support memblock go over
      memblock.memory one by one, so will take too many try near the end.
      
      We can use existing memblock_search to find the node id for given pfn,
      that could save some time on bigger system that have many entries
      memblock.memory array.
      
      Here are the timing differences for several machines.  In each case with
      the patch less time was spent in __early_pfn_to_nid().
      
                              3.11-rc5        with patch      difference (%)
                              --------        ----------      --------------
      UV1: 256 nodes  9TB:     411.66          402.47         -9.19 (2.23%)
      UV2: 255 nodes 16TB:    1141.02         1138.12         -2.90 (0.25%)
      UV2:  64 nodes  2TB:     128.15          126.53         -1.62 (1.26%)
      UV2:  32 nodes  2TB:     121.87          121.07         -0.80 (0.66%)
                              Time in seconds.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e76b63f8
    • N
      mm: migrate: check movability of hugepage in unmap_and_move_huge_page() · 83467efb
      Naoya Horiguchi 提交于
      Currently hugepage migration works well only for pmd-based hugepages
      (mainly due to lack of testing,) so we had better not enable migration of
      other levels of hugepages until we are ready for it.
      
      Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
      page table walk and check pud/pmd_huge() there, so they are safe.  But the
      other users (softoffline and memory hotremove) don't do this, so without
      this patch they can try to migrate unexpected types of hugepages.
      
      To prevent this, we introduce hugepage_migration_support() as an
      architecture dependent check of whether hugepage are implemented on a pmd
      basis or not.  And on some architecture multiple sizes of hugepages are
      available, so hugepage_migration_support() also checks hugepage size.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83467efb
    • N
      mm: memory-hotplug: enable memory hotplug to handle hugepage · c8721bbb
      Naoya Horiguchi 提交于
      Until now we can't offline memory blocks which contain hugepages because a
      hugepage is considered as an unmovable page.  But now with this patch
      series, a hugepage has become movable, so by using hugepage migration we
      can offline such memory blocks.
      
      What's different from other users of hugepage migration is that we need to
      decompose all the hugepages inside the target memory block into free buddy
      pages after hugepage migration, because otherwise free hugepages remaining
      in the memory block intervene the memory offlining.  For this reason we
      introduce new functions dissolve_free_huge_page() and
      dissolve_free_huge_pages().
      
      Other than that, what this patch does is straightforwardly to add hugepage
      migration code, that is, adding hugepage code to the functions which scan
      over pfn and collect hugepages to be migrated, and adding a hugepage
      allocation function to alloc_migrate_target().
      
      As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
      over them because it's larger than memory block.  So we now simply leave
      it to fail as it is.
      
      [yongjun_wei@trendmicro.com.cn: remove duplicated include]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8721bbb
    • N
      mm: migrate: remove VM_HUGETLB from vma flag check in vma_migratable() · 71ea2efb
      Naoya Horiguchi 提交于
      Enable hugepage migration from migrate_pages(2), move_pages(2), and
      mbind(2).
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71ea2efb
    • N
      mm: mbind: add hugepage migration code to mbind() · 74060e4d
      Naoya Horiguchi 提交于
      Extend do_mbind() to handle vma with VM_HUGETLB set.  We will be able to
      migrate hugepage with mbind(2) after applying the enablement patch which
      comes later in this series.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74060e4d
    • N
      mm: soft-offline: use migrate_pages() instead of migrate_huge_page() · b8ec1cee
      Naoya Horiguchi 提交于
      Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
      as an argument, instead of taking a pointer to the list of hugepages to be
      migrated.  This behavior was introduced in commit 189ebff2 ("hugetlb:
      simplify migrate_huge_page()"), and was OK because until now hugepage
      migration is enabled only for soft-offlining which migrates only one
      hugepage in a single call.
      
      But the situation will change in the later patches in this series which
      enable other users of page migration to support hugepage migration.  They
      can kick migration for both of normal pages and hugepages in a single
      call, so we need to go back to original implementation which uses linked
      lists to collect the hugepages to be migrated.
      
      With this patch, soft_offline_huge_page() switches to use migrate_pages(),
      and migrate_huge_page() is not used any more.  So let's remove it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ec1cee
    • N
      mm: migrate: make core migration code aware of hugepage · 31caf665
      Naoya Horiguchi 提交于
      Currently hugepage migration is available only for soft offlining, but
      it's also useful for some other users of page migration (clearly because
      users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
      So this patchset tries to extend such users to support hugepage migration.
      
      The target of this patchset is to enable hugepage migration for NUMA
      related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
      memory hotplug.
      
      This patchset does not add hugepage migration for memory compaction,
      because users of memory compaction mainly expect to construct thp by
      arranging raw pages, and there's little or no need to compact hugepages.
      CMA, another user of page migration, can have benefit from hugepage
      migration, but is not enabled to support it for now (just because of lack
      of testing and expertise in CMA.)
      
      Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
      x86_64, or hugepages in architectures like ia64) is not enabled for now
      (again, because of lack of testing.)
      
      As for how these are achived, I extended the API (migrate_pages()) to
      handle hugepage (with patch 1 and 2) and adjusted code of each caller to
      check and collect movable hugepages (with patch 3-7).  Remaining 2 patches
      are kind of miscellaneous ones to avoid unexpected behavior.  Patch 8 is
      about making sure that we only migrate pmd-based hugepages.  And patch 9
      is about choosing appropriate zone for hugepage allocation.
      
      My test is mainly functional one, simply kicking hugepage migration via
      each entry point and confirm that migration is done correctly.  Test code
      is available here:
      
        git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
      
      And I always run libhugetlbfs test when changing hugetlbfs's code.  With
      this patchset, no regression was found in the test.
      
      This patch (of 9):
      
      Before enabling each user of page migration to support hugepage,
      this patch enables the list of pages for migration to link not only
      LRU pages, but also hugepages. As a result, putback_movable_pages()
      and migrate_pages() can handle both of LRU pages and hugepages.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31caf665
    • J
      lib/genalloc.c: fix overflow of ending address of memory chunk · 674470d9
      Joonyoung Shim 提交于
      In struct gen_pool_chunk, end_addr means the end address of memory chunk
      (inclusive), but in the implementation it is treated as address + size of
      memory chunk (exclusive), so it points to the address plus one instead of
      correct ending address.
      
      The ending address of memory chunk plus one will cause overflow on the
      memory chunk including the last address of memory map, e.g.  when starting
      address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
      address will be 0x100000000.
      
      Use correct ending address like starting address + size - 1.
      
      [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
      Signed-off-by: NJoonyoung Shim <jy0922.shim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      674470d9
    • C
      vmstat: create separate function to fold per cpu diffs into local counters · 2bb921e5
      Christoph Lameter 提交于
      The main idea behind this patchset is to reduce the vmstat update overhead
      by avoiding interrupt enable/disable and the use of per cpu atomics.
      
      This patch (of 3):
      
      It is better to have a separate folding function because
      refresh_cpu_vm_stats() also does other things like expire pages in the
      page allocator caches.
      
      If we have a separate function then refresh_cpu_vm_stats() is only called
      from the local cpu which allows additional optimizations.
      
      The folding function is only called when a cpu is being downed and
      therefore no other processor will be accessing the counters.  Also
      simplifies synchronization.
      
      [akpm@linux-foundation.org: fix UP build]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb921e5
    • J
      swap: clean-up #ifdef in page_mapping() · d2cf5ad6
      Joonsoo Kim 提交于
      PageSwapCache() is always false when !CONFIG_SWAP, so compiler
      properly discard related code. Therefore, we don't need #ifdef explicitly.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cf5ad6
    • J
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner 提交于
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      etc.
      
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: NKevin Hilman <khilman@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81c0a2bb
    • S
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li 提交于
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • S
      swap: make swap discard async · 815c2c54
      Shaohua Li 提交于
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • S
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li 提交于
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449
    • D
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen 提交于
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df46865