1. 24 2月, 2013 10 次提交
    • J
      mm: vmscan: compaction works against zones, not lruvecs · 9b4f98cd
      Johannes Weiner 提交于
      The restart logic for when reclaim operates back to back with compaction
      is currently applied on the lruvec level.  But this does not make sense,
      because the container of interest for compaction is a zone as a whole,
      not the zone pages that are part of a certain memory cgroup.
      
      Negative impact is bounded.  For one, the code checks that the lruvec
      has enough reclaim candidates, so it does not risk getting stuck on a
      condition that can not be fulfilled.  And the unfairness of hammering on
      one particular memory cgroup to make progress in a zone will be
      amortized by the round robin manner in which reclaim goes through the
      memory cgroups.  Still, this can lead to unnecessary allocation
      latencies when the code elects to restart on a hard to reclaim or small
      group when there are other, more reclaimable groups in the zone.
      
      Move this logic to the zone level and restart reclaim for all memory
      cgroups in a zone when compaction requires more free pages from it.
      
      [akpm@linux-foundation.org: no need for min_t]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b4f98cd
    • J
      mm: vmscan: clean up get_scan_count() · 9a265114
      Johannes Weiner 提交于
      Reclaim pressure balance between anon and file pages is calculated
      through a tuple of numerators and a shared denominator.
      
      Exceptional cases that want to force-scan anon or file pages configure
      the numerators and denominator such that one list is preferred, which is
      not necessarily the most obvious way:
      
          fraction[0] = 1;
          fraction[1] = 0;
          denominator = 1;
          goto out;
      
      Make this easier by making the force-scan cases explicit and use the
      fractionals only in case they are calculated from reclaim history.
      
      [akpm@linux-foundation.org: avoid using unintialized_var()]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a265114
    • J
      mm: vmscan: improve comment on low-page cache handling · 11d16c25
      Johannes Weiner 提交于
      Fix comment style and elaborate on why anonymous memory is force-scanned
      when file cache runs low.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d16c25
    • J
      mm: vmscan: clarify how swappiness, highest priority, memcg interact · 10316b31
      Johannes Weiner 提交于
      A swappiness of 0 has a slightly different meaning for global reclaim
      (may swap if file cache really low) and memory cgroup reclaim (never
      swap, ever).
      
      In addition, global reclaim at highest priority will scan all LRU lists
      equal to their size and ignore other balancing heuristics.  UNLESS
      swappiness forbids swapping, then the lists are balanced based on recent
      reclaim effectiveness.  UNLESS file cache is running low, then anonymous
      pages are force-scanned.
      
      This (total mess of a) behaviour is implicit and not obvious from the
      way the code is organized.  At least make it apparent in the code flow
      and document the conditions.  It will be it easier to come up with sane
      semantics later.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NSatoru Moriya <satoru.moriya@hds.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10316b31
    • J
      mm: vmscan: save work scanning (almost) empty LRU lists · d778df51
      Johannes Weiner 提交于
      In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
      amount of pages is scanned from the LRU lists on each iteration, to make
      progress.
      
      Do not make this minimum bigger than the respective LRU list size,
      however, and save some busy work trying to isolate and reclaim pages
      that are not there.
      
      Empty LRU lists are quite common with memory cgroups in NUMA
      environments because there exists a set of LRU lists for each zone for
      each memory cgroup, while the memory of a single cgroup is expected to
      stay on just one node.  The number of expected empty LRU lists is thus
      
        memcgs * (nodes - 1) * lru types
      
      Each attempt to reclaim from an empty LRU list does expensive size
      comparisons between lists, acquires the zone's lru lock etc.  Avoid
      that.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d778df51
    • J
      mm: memcg: only evict file pages when we have plenty · 7c5bd705
      Johannes Weiner 提交于
      Commit e9868505 ("mm, vmscan: only evict file pages when we have
      plenty") makes a point of not going for anonymous memory while there is
      still enough inactive cache around.
      
      The check was added only for global reclaim, but it is just as useful to
      reduce swapping in memory cgroup reclaim:
      
          200M-memcg-defconfig-j2
      
                                           vanilla                   patched
          Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
          User time              668.57 (  +0.00%)         668.73 (  +0.02%)
          System time            128.92 (  +0.00%)         129.53 (  +0.46%)
          Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
          Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
          Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
          Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
          THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
          THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
          THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5bd705
    • S
      CMA: make putback_lru_pages() call conditional · 2a6f5124
      Srinivas Pandruvada 提交于
      As per documentation and other places calling putback_lru_pages(),
      putback_lru_pages() is called on error only.  Make the CMA code behave
      consistently.
      
      [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6f5124
    • A
      mm/hugetlb.c: convert to pr_foo() · ffb22af5
      Andrew Morton 提交于
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffb22af5
    • A
      mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo() · d045197f
      Andrew Morton 提交于
      Acked-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d045197f
    • S
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju 提交于
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
  2. 22 2月, 2013 3 次提交
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
    • D
      bdi: allow block devices to say that they require stable page writes · 7d311cda
      Darrick J. Wong 提交于
      This patchset ("stable page writes, part 2") makes some key
      modifications to the original 'stable page writes' patchset.  First, it
      provides creators (devices and filesystems) of a backing_dev_info a flag
      that declares whether or not it is necessary to ensure that page
      contents cannot change during writeout.  It is no longer assumed that
      this is true of all devices (which was never true anyway).  Second, the
      flag is used to relaxed the wait_on_page_writeback calls so that wait
      only occurs if the device needs it.  Third, it fixes up the remaining
      disk-backed filesystems to use this improved conditional-wait logic to
      provide stable page writes on those filesystems.
      
      It is hoped that (for people not using checksumming devices, anyway)
      this patchset will give back unnecessary performance decreases since the
      original stable page write patchset went into 3.0.  Sorry about not
      fixing it sooner.
      
      Complaints were registered by several people about the long write
      latencies introduced by the original stable page write patchset.
      Generally speaking, the kernel ought to allocate as little extra memory
      as possible to facilitate writeout, but for people who simply cannot
      wait, a second page stability strategy is (re)introduced: snapshotting
      page contents.  The waiting behavior is still the default strategy; to
      enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
      set.  This flag is used to bandaid^Henable stable page writeback on
      ext3[1], and is not used anywhere else.
      
      Given that there are already a few storage devices and network FSes that
      have rolled their own page stability wait/page snapshot code, it would
      be nice to move towards consolidating all of these.  It seems possible
      that iscsi and raid5 may wish to use the new stable page write support
      to enable zero-copy writeout.
      
      Thank you to Jan Kara for helping fix a couple more filesystems.
      
      Per Andrew Morton's request, here are the result of using dbench to measure
      latencies on ext2:
      
      3.8.0-rc3:
         Operation      Count    AvgLat    MaxLat
         ----------------------------------------
         WriteX        109347     0.028    59.817
         ReadX         347180     0.004     3.391
         Flush          15514    29.828   287.283
      
        Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
         WriteX        105556     0.029     4.273
         ReadX         335004     0.005     4.112
         Flush          14982    30.540   298.634
      
        Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, for ext2 the maximum write latency decreases from ~60ms
      on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
      increase, though I suspect that being able to dirty pages faster gives
      the flusher more work to do.
      
      On ext4, the average write latency decreases as well as all the maximum
      latencies:
      
      3.8.0-rc3:
         WriteX         85624     0.152    33.078
         ReadX         272090     0.010    61.210
         Flush          12129    36.219   168.260
      
        Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms
      
      3.8.0-rc3 + patches:
         WriteX         86082     0.141    30.928
         ReadX         273358     0.010    36.124
         Flush          12214    34.800   165.689
      
        Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms
      
      XFS seems to exhibit similar latency improvements as ext2:
      
      3.8.0-rc3:
         WriteX        125739     0.028   104.343
         ReadX         399070     0.005     4.115
         Flush          17851    25.004   131.390
      
        Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms
      
      3.8.0-rc3 + patches:
         WriteX        123529     0.028     6.299
         ReadX         392434     0.005     4.287
         Flush          17549    25.120   188.687
      
        Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms
      
      ...and btrfs, just to round things out, also shows some latency
      decreases:
      
      3.8.0-rc3:
         WriteX         67122     0.083    82.355
         ReadX         212719     0.005     2.828
         Flush           9547    47.561   147.418
      
        Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms
      
      3.8.0-rc3 + patches:
         WriteX         64898     0.101    71.631
         ReadX         206673     0.005     7.123
         Flush           9190    47.963   219.034
      
        Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own wait code, or they don't block at all.  The blocking
      behavior is back to what it was before 3.0 if you don't have a disk
      requiring stable page writes.
      
      This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
      xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
      results as -rc3.
      
      [1] The alternative fixes to ext3 include fixing the locking order and
      page bit handling like we did for ext4 (but then why not just use
      ext4?), or setting PG_writeback so early that ext3 becomes extremely
      slow.  I tried that, but the number of write()s I could initiate dropped
      by nearly an order of magnitude.  That was a bit much even for the
      author of the stable page series! :)
      
      This patch:
      
      Creates a per-backing-device flag that tracks whether or not pages must
      be held immutable during writeout.  Eventually it will be used to waive
      wait_for_page_writeback() if nothing requires stable pages.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d311cda
  3. 19 2月, 2013 1 次提交
    • L
      mm: fix pageblock bitmap allocation · 7c45512d
      Linus Torvalds 提交于
      Commit c060f943 ("mm: use aligned zone start for pfn_to_bitidx
      calculation") fixed out calculation of the index into the pageblock
      bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
      
      However, the _allocation_ of that bitmap had never taken this alignment
      requirement into accout, so depending on the exact size and alignment of
      the zone, the use of that index could then access past the allocation,
      resulting in some very subtle memory corruption.
      
      This was reported (and bisected) by Ingo Molnar: one of his random
      config builds would hang with certain very specific kernel command line
      options.
      
      In the meantime, commit c060f943 has been marked for stable, so this
      fix needs to be back-ported to the stable kernels that backported the
      commit to use the right alignment.
      Bisected-and-tested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c45512d
  4. 14 2月, 2013 1 次提交
    • M
      s390/mm: implement software dirty bits · abf09bed
      Martin Schwidefsky 提交于
      The s390 architecture is unique in respect to dirty page detection,
      it uses the change bit in the per-page storage key to track page
      modifications. All other architectures track dirty bits by means
      of page table entries. This property of s390 has caused numerous
      problems in the past, e.g. see git commit ef5d437f
      "mm: fix XFS oops due to dirty pages without buffers on s390".
      
      To avoid future issues in regard to per-page dirty bits convert
      s390 to a fault based software dirty bit detection mechanism. All
      user page table entries which are marked as clean will be hardware
      read-only, even if the pte is supposed to be writable. A write by
      the user process will trigger a protection fault which will cause
      the user pte to be marked as dirty and the hardware read-only bit
      is removed.
      
      With this change the dirty bit in the storage key is irrelevant
      for Linux as a host, but the storage key is still required for
      KVM guests. The effect is that page_test_and_clear_dirty and the
      related code can be removed. The referenced bit in the storage
      key is still used by the page_test_and_clear_young primitive to
      provide page age information.
      
      For page cache pages of mappings with mapping_cap_account_dirty
      there will not be any change in behavior as the dirty bit tracking
      already uses read-only ptes to control the amount of dirty pages.
      Only for swap cache pages and pages of mappings without
      mapping_cap_account_dirty there can be additional protection faults.
      To avoid an excessive number of additional faults the mk_pte
      primitive checks for PageDirty if the pgprot value allows for writes
      and pre-dirties the pte. That avoids all additional faults for
      tmpfs and shmem pages until these pages are added to the swap cache.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      abf09bed
  5. 13 2月, 2013 3 次提交
  6. 08 2月, 2013 2 次提交
  7. 05 2月, 2013 3 次提交
  8. 30 1月, 2013 2 次提交
  9. 24 1月, 2013 1 次提交
  10. 18 1月, 2013 1 次提交
  11. 12 1月, 2013 9 次提交
    • M
      mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca ("mm: compaction: capture a
      suitable high-order page immediately when it is made available").
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb74b9f
    • M
      mm: thp: acquire the anon_vma rwsem for write during split · 062f1af2
      Mel Gorman 提交于
      Zhouping Liu reported the following against 3.8-rc1 when running a mmap
      testcase from LTP.
      
        mapcount 0 page_mapcount 3
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:1798!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
        CPU 1
        Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
        RIP: __split_huge_page+0x677/0x6d0
        RSP: 0000:ffff88017a03fc08  EFLAGS: 00010293
        RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
        RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
        RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
        R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
        R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
        FS:  00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
        Stack:
         ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
         ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
         00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
        Call Trace:
          split_huge_page+0x68/0xb0
          __split_huge_page_pmd+0x134/0x330
          split_huge_page_pmd_mm+0x51/0x60
          split_huge_page_address+0x3b/0x50
          __vma_adjust_trans_huge+0x9c/0xf0
          vma_adjust+0x684/0x750
          __split_vma.isra.28+0x1fa/0x220
          do_munmap+0xf9/0x420
          vm_munmap+0x4e/0x70
          sys_munmap+0x2b/0x40
          system_call_fastpath+0x16/0x1b
      
      Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton
      identified that commit 5a505085 ("mm/rmap: Convert the struct
      anon_vma::mutex to an rwsem") and commit 4fc3f1d6 ("mm/rmap,
      migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable")
      were likely the problem.  Reverting these commits was reported to solve
      the problem for Alexander.
      
      Despite the reason for these commits, NUMA balancing is not the direct
      source of the problem.  split_huge_page() expects the anon_vma lock to
      be exclusive to serialise the whole split operation.  Ordinarily it is
      expected that the anon_vma lock would only be required when updating the
      avcs but THP also uses the anon_vma rwsem for collapse and split
      operations where the page lock or compound lock cannot be used (as the
      page is changing from base to THP or vice versa) and the page table
      locks are insufficient.
      
      This patch takes the anon_vma lock for write to serialise against parallel
      split_huge_page as THP expected before the conversion to rwsem.
      Reported-and-tested-by: NZhouping Liu <zliu@redhat.com>
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Reported-by: NAlex Xu <alex_y_xu@yahoo.ca>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      062f1af2
    • J
      mm: mmap: annotate vm_lock_anon_vma locking properly for lockdep · 572043c9
      Jiri Kosina 提交于
      Commit 5a505085 ("mm/rmap: Convert the struct anon_vma::mutex to an
      rwsem") turned anon_vma mutex to rwsem.
      
      However, the properly annotated nested locking in mm_take_all_locks()
      has been converted from
      
      	mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);
      
      to
      
      	down_write(&anon_vma->root->rwsem);
      
      which is incomplete, and causes the false positive report from lockdep
      below.
      
      Annotate the fact that mmap_sem is used as an outter lock to serialize
      taking of all the anon_vma rwsems at once no matter the order, using the
      down_write_nest_lock() primitive.
      
      This patch fixes this lockdep report:
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.8.0-rc2-00036-g5f738967 #171 Not tainted
       ---------------------------------------------
       qemu-kvm/2315 is trying to acquire lock:
        (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       but task is already holding lock:
        (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&anon_vma->rwsem);
         lock(&anon_vma->rwsem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by qemu-kvm/2315:
        #0:  (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
        #1:  (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
        #2:  (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
        #3:  (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0
      
       stack backtrace:
       Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f738967 #171
       Call Trace:
         print_deadlock_bug+0xf2/0x100
         validate_chain+0x4f6/0x720
         __lock_acquire+0x359/0x580
         lock_acquire+0x121/0x190
         down_write+0x3f/0x70
         mm_take_all_locks+0x149/0x1b0
         do_mmu_notifier_register+0x68/0x170
         mmu_notifier_register+0xe/0x10
         kvm_create_vm+0x22b/0x330 [kvm]
         kvm_dev_ioctl+0xf8/0x1a0 [kvm]
         do_vfs_ioctl+0x9d/0x350
         sys_ioctl+0x91/0xb0
         system_call_fastpath+0x16/0x1b
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      572043c9
    • M
      mm: bootmem: fix free_all_bootmem_core() with odd bitmap alignment · 10d73e65
      Max Filippov 提交于
      Currently free_all_bootmem_core ignores that node_min_pfn may be not
      multiple of BITS_PER_LONG.  Eg commit 6dccdcbe ("mm: bootmem: fix
      checking the bitmap when finally freeing bootmem") shifts vec by lower
      bits of start instead of lower bits of idx.  Also
      
        if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL)
      
      assumes that vec bit 0 corresponds to start pfn, which is only true when
      node_min_pfn is a multiple of BITS_PER_LONG.  Also loop in the else
      clause can double-free pages (e.g.  with node_min_pfn == start == 1,
      map[0] == ~0 on 32-bit machine page 32 will be double-freed).
      
      This bug causes the following message during xtensa kernel boot:
      
        bootmem::free_all_bootmem_core nid=0 start=1 end=8000
        BUG: Bad page state in process swapper  pfn:00001
        page:d04bd020 count:0 mapcount:-127 mapping:  (null) index:0x2
        page flags: 0x0()
        Call Trace:
          bad_page+0x8c/0x9c
          free_pages_prepare+0x5e/0x88
          free_hot_cold_page+0xc/0xa0
          __free_pages+0x24/0x38
          __free_pages_bootmem+0x54/0x56
          free_all_bootmem_core$part$11+0xeb/0x138
          free_all_bootmem+0x46/0x58
          mem_init+0x25/0xa4
          start_kernel+0x11e/0x25c
          should_never_return+0x0/0x3be7
      
      The fix is the following:
       - always align vec so that its bit 0 corresponds to start
       - provide BITS_PER_LONG bits in vec, if those bits are available in the
         map
       - don't free pages past next start position in the else clause.
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Prasad Koya <prasad.koya@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d73e65
    • L
      mm: use aligned zone start for pfn_to_bitidx calculation · c060f943
      Laura Abbott 提交于
      The current calculation in pfn_to_bitidx assumes that (pfn -
      zone->zone_start_pfn) >> pageblock_order will return the same bit for
      all pfn in a pageblock.  If zone_start_pfn is not aligned to
      pageblock_nr_pages, this may not always be correct.
      
      Consider the following with pageblock order = 10, zone start 2MB:
      
        pfn     | pfn - zone start | (pfn - zone start) >> page block order
        ----------------------------------------------------------------
        0x26000 | 0x25e00	   |  0x97
        0x26100 | 0x25f00	   |  0x97
        0x26200 | 0x26000	   |  0x98
        0x26300 | 0x26100	   |  0x98
      
      This means that calling {get,set}_pageblock_migratetype on a single page
      will not set the migratetype for the full block.  Fix this by rounding
      down zone_start_pfn when doing the bitidx calculation.
      
      For our use case, the effects of this bug were mostly tied to the fact
      that CMA allocations would either take a long time or fail to happen.
      Depending on the driver using CMA, this could result in anything from
      visual glitches to application failures.
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c060f943
    • J
      mm: compaction: fix echo 1 > compact_memory return error issue · 7964c06d
      Jason Liu 提交于
      when run the folloing command under shell, it will return error
      
        sh/$ echo 1 > /proc/sys/vm/compact_memory
        sh/$ sh: write error: Bad address
      
      After strace, I found the following log:
      
        ...
        write(1, "1\n", 2)               = 3
        write(1, "", 4294967295)         = -1 EFAULT (Bad address)
        write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
        ) = 31
      
      This tells system return 3(COMPACT_COMPLETE) after write data to
      compact_memory.
      
      The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
      from sysctl_compaction_handler after compaction_nodes finished.
      Signed-off-by: NJason Liu <r64343@freescale.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7964c06d
    • L
      mm: memblock: fix wrong memmove size in memblock_merge_regions() · c0232ae8
      Lin Feng 提交于
      The memmove span covers from (next+1) to the end of the array, and the
      index of next is (i+1), so the index of (next+1) is (i+2).  So the size
      of remaining array elements is (type->cnt - (i + 2)).
      
      Since the remaining elements of the memblock array are move forward by
      one element and there is only one additional element caused by this bug.
      So there won't be any write overflow here but read overflow.  It may
      read one more element out of the array address if the array happens to
      be full.  Commonly it doesn't matter at all but if the array happens to
      be located at the end a memblock, it may cause a invalid read operation
      for the physical address doesn't exist.
      
      There are 2 *happens to be* here, so I think the probability is quite
      low, I don't know if any guy is haunted by this bug before.
      
      Mostly I think it's user-invisible.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0232ae8
    • M
      mm: migrate: check page_count of THP before migrating · 04fa5d6a
      Mel Gorman 提交于
      Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
      not check page_count before migrating like base page migration and
      khugepage.  He could not see why this was safe and he is right.
      
      The potential impact of the bug is avoided due to the limitations of
      NUMA balancing.  The page_mapcount() check ensures that only a single
      address space is using this page and as THPs are typically private it
      should not be possible for another address space to fault it in
      parallel.  If the address space has one associated task then it's
      difficult to have both a GUP pin and be referencing the page at the same
      time.  If there are multiple tasks then a buggy scenario requires that
      another thread be accessing the page while the direct IO is in flight.
      This is dodgy behaviour as there is a possibility of corruption with or
      without THP migration.  It would be
      
      While we happen to be safe for the most part it is shoddy to depend on
      such "safety" so this patch checks the page count similar to anonymous
      pages.  Note that this does not mean that the page_mapcount() check can
      go away.  If we were to remove the page_mapcount() check the the THP
      would have to be unmapped from all referencing PTEs, replaced with
      migration PTEs and restored properly afterwards.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fa5d6a
    • M
      mm: compaction: Partially revert capture of suitable high-order page · 47ecfcb7
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
      high-order page immediately when it is made available".
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47ecfcb7
  12. 10 1月, 2013 1 次提交
  13. 05 1月, 2013 2 次提交
    • M
      mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT · 53a59fc6
      Michal Hocko 提交于
      Since commit e303297e ("mm: extended batches for generic
      mmu_gather") we are batching pages to be freed until either
      tlb_next_batch cannot allocate a new batch or we are done.
      
      This works just fine most of the time but we can get in troubles with
      non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
      on large machines where too aggressive batching might lead to soft
      lockups during process exit path (exit_mmap) because there are no
      scheduling points down the free_pages_and_swap_cache path and so the
      freeing can take long enough to trigger the soft lockup.
      
      The lockup is harmless except when the system is setup to panic on
      softlockup which is not that unusual.
      
      The simplest way to work around this issue is to limit the maximum
      number of batches in a single mmu_gather.  10k of collected pages should
      be safe to prevent from soft lockups (we would have 2ms for one) even if
      they are all freed without an explicit scheduling point.
      
      This patch doesn't add any new explicit scheduling points because it
      relies on zap_pmd_range during page tables zapping which calls
      cond_resched per PMD.
      
      The following lockup has been reported for 3.0 kernel with a huge
      process (in order of hundreds gigs but I do know any more details).
      
        BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
        Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
        Supported: Yes
        CPU 56
        Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
        RIP: 0010:  _raw_spin_unlock_irqrestore+0x8/0x10
        RSP: 0018:ffff883ec1037af0  EFLAGS: 00000206
        RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
        RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
        RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
        R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
        R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
        FS:  00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
        Call Trace:
          release_pages+0xc5/0x260
          free_pages_and_swap_cache+0x9d/0xc0
          tlb_flush_mmu+0x5c/0x80
          tlb_finish_mmu+0xe/0x50
          exit_mmap+0xbd/0x120
          mmput+0x49/0x120
          exit_mm+0x122/0x160
          do_exit+0x17a/0x430
          do_group_exit+0x3d/0xb0
          get_signal_to_deliver+0x247/0x480
          do_signal+0x71/0x1b0
          do_notify_resume+0x98/0xb0
          int_signal+0x12/0x17
        DWARF2 unwinder stuck at int_signal+0x12/0x17
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.0+]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a59fc6
    • B
      mm: fix zone_watermark_ok_safe() accounting of isolated pages · a458431e
      Bartlomiej Zolnierkiewicz 提交于
      Commit 702d1a6e ("memory-hotplug: fix kswapd looping forever
      problem") added an isolated pageblocks counter (nr_pageblock_isolate in
      struct zone) and used it to adjust free pages counter in
      zone_watermark_ok_safe() to prevent kswapd looping forever problem.
      
      Then later, commit 2139cbe6 ("cma: fix counting of isolated pages")
      fixed accounting of isolated pages in global free pages counter.  It
      made the previous zone_watermark_ok_safe() fix unnecessary and
      potentially harmful (cause now isolated pages may be accounted twice
      making free pages counter incorrect).
      
      This patch removes the special isolated pageblocks counter altogether
      which fixes zone_watermark_ok_safe() free pages check.
      Reported-by: NTomasz Stanislawski <t.stanislaws@samsung.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a458431e
  14. 04 1月, 2013 1 次提交
    • G
      MM: vmscan: remove __devinit attribute. · fcb35a9b
      Greg Kroah-Hartman 提交于
      CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
      markings need to be removed.
      
      This change removes the use of __devinit from the file.
      
      Based on patches originally written by Bill Pemberton, but redone by me
      in order to handle some of the coding style issues better, by hand.
      
      Cc: Bill Pemberton <wfp5p@virginia.edu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcb35a9b