1. 04 7月, 2013 40 次提交
    • R
      swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES · dcf6b7dd
      Rafael Aquini 提交于
      Considering the use cases where the swap device supports discard:
      a) and can do it quickly;
      b) but it's slow to do in small granularities (or concurrent with other
         I/O);
      c) but the implementation is so horrendous that you don't even want to
         send one down;
      
      And assuming that the sysadmin considers it useful to send the discards down
      at all, we would (probably) want the following solutions:
      
        i. do the fine-grained discards for freed swap pages, if device is
           capable of doing so optimally;
       ii. do single-time (batched) swap area discards, either at swapon
           or via something like fstrim (not implemented yet);
      iii. allow doing both single-time and fine-grained discards; or
       iv. turn it off completely (default behavior)
      
      As implemented today, one can only enable/disable discards for swap, but
      one cannot select, for instance, solution (ii) on a swap device like (b)
      even though the single-time discard is regarded to be interesting, or
      necessary to the workload because it would imply (1), and the device is
      not capable of performing it optimally.
      
      This patch addresses the scenario depicted above by introducing a way to
      ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
      flagged through swapon(8) to allow a sysadmin to select the best suitable
      swap discard policy accordingly to system constraints.
      
      This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
      new flags to allow more flexibe swap discard policies being flagged
      through swapon(8).  The default behavior is to keep both single-time, or
      batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
      for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
      consistentcy with older kernel behavior, as well as maintain compatibility
      with older swapon(8).  However, through the new introduced flags the best
      suitable discard policy can be selected accordingly to any given swap
      device constraint.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6b7dd
    • T
      mm: tune vm_committed_as percpu_counter batching size · 917d9290
      Tim Chen 提交于
      Currently the per cpu counter's batch size for memory accounting is
      configured as twice the number of cpus in the system.  However, for
      system with very large memory, it is more appropriate to make it
      proportional to the memory size per cpu in the system.
      
      For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
      batch size is only 2*64 pages (0.5 MB).  So any memory accounting
      changes of more than 0.5MB will overflow the per cpu counter into the
      global counter.  Instead, for the new scheme, the batch size is
      configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is
      more inline with the memory size.
      
      I've done a repeated brk test of 800KB (from will-it-scale test suite)
      with 80 concurrent processes on a 4 socket Westmere machine with a total
      of 40 cores.  Without the patch, about 80% of cpu is spent on spin-lock
      contention within the vm_committed_as counter.  With the patch, there's
      a 73x speedup on the benchmark and the lock contention drops off almost
      entirely.
      
      [akpm@linux-foundation.org: fix section mismatch]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917d9290
    • W
      mm/hugetlb: use already existing interface huge_page_shift · 2415cf12
      Wanpeng Li 提交于
      Use the already existing interface huge_page_shift instead of h->order +
      PAGE_SHIFT.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2415cf12
    • W
      mm/hugetlb: remove hugetlb_prefault · 5f1e31d2
      Wanpeng Li 提交于
      hugetlb_prefault() is not used any more, this patch removes it.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1e31d2
    • W
      mm/pageblock: remove get/set_pageblock_flags · 4c42efa2
      Wanpeng Li 提交于
      get_pageblock_flags and set_pageblock_flags are not used any more, this
      patch removes them.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c42efa2
    • W
      mm/memory-hotplug: fix lowmem count overflow when offline pages · cea27eb2
      Wanpeng Li 提交于
      The logic for the memory-remove code fails to correctly account the
      Total High Memory when a memory block which contains High Memory is
      offlined as shown in the example below.  The following patch fixes it.
      
      Before logic memory remove:
      
      MemTotal:        7603740 kB
      MemFree:         6329612 kB
      Buffers:           94352 kB
      Cached:           872008 kB
      SwapCached:            0 kB
      Active:           626932 kB
      Inactive:         519216 kB
      Active(anon):     180776 kB
      Inactive(anon):   222944 kB
      Active(file):     446156 kB
      Inactive(file):   296272 kB
      Unevictable:           0 kB
      Mlocked:               0 kB
      HighTotal:       7294672 kB
      HighFree:        5704696 kB
      LowTotal:         309068 kB
      LowFree:          624916 kB
      
      After logic memory remove:
      
      MemTotal:        7079452 kB
      MemFree:         5805976 kB
      Buffers:           94372 kB
      Cached:           872000 kB
      SwapCached:            0 kB
      Active:           626936 kB
      Inactive:         519236 kB
      Active(anon):     180780 kB
      Inactive(anon):   222944 kB
      Active(file):     446156 kB
      Inactive(file):   296292 kB
      Unevictable:           0 kB
      Mlocked:               0 kB
      HighTotal:       7294672 kB
      HighFree:        5181024 kB
      LowTotal:       4294752076 kB
      LowFree:          624952 kB
      
      [mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[2.6.24+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cea27eb2
    • T
      mm/memory_hotplug.c: change normal message to use pr_debug · 4996eed8
      Toshi Kani 提交于
      During early boot-up, iomem_resource is set up from the boot descriptor
      table, such as EFI Memory Table and e820.  Later,
      acpi_memory_device_add() calls add_memory() for each ACPI memory device
      object as it enumerates ACPI namespace.  This add_memory() call is
      expected to fail in register_memory_resource() at boot since
      iomem_resource has been set up from EFI/e820.  As a result, add_memory()
      returns -EEXIST, which acpi_memory_device_add() handles as the normal
      case.
      
      This scheme works fine, but the following error message is logged for
      every ACPI memory device object during boot-up.
      
        "System RAM resource %pR cannot be added\n"
      
      This patch changes register_memory_resource() to use pr_debug() for the
      message as it shows up under the normal case.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4996eed8
    • N
      mm/memory-failure.c: fix memory leak in successful soft offlining · f15bdfa8
      Naoya Horiguchi 提交于
      After a successful page migration by soft offlining, the source page is
      not properly freed and it's never reusable even if we unpoison it
      afterward.
      
      This is caused by the race between freeing page and setting PG_hwpoison.
      In successful soft offlining, the source page is put (and the refcount
      becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
      to pagevec and actual freeing back to buddy is delayed.  So if
      PG_hwpoison is set for the page before freeing, the freeing does not
      functions as expected (in such case freeing aborts in
      free_pages_prepare() check.)
      
      This patch tries to make sure to free the source page before setting
      PG_hwpoison on it.  To avoid reallocating, the page keeps
      MIGRATE_ISOLATE until after setting PG_hwpoison.
      
      This patch also removes obsolete comments about "keeping elevated
      refcount" because what they say is not true.  Unlike memory_failure(),
      soft_offline_page() uses no special page isolation code, and the
      soft-offlined pages have no elevated.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f15bdfa8
    • C
      mm/nommu.c: add additional check for vread() just like vwrite() has done · 9bde916b
      Chen Gang 提交于
      vwrite() checks for overflow. vread() should do the same thing.
      
      Since vwrite() checks the source buffer address, vread() should check
      the destination buffer address.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bde916b
    • C
      mm/page_alloc.c: add additional checking and return value for the 'table->data' · dacbde09
      Chen Gang 提交于
      - check the length of the procfs data before copying it into a fixed
        size array.
      
      - when __parse_numa_zonelist_order() fails, save the error code for
        return.
      
      - 'char*' --> 'char *' coding style fix
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dacbde09
    • M
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman 提交于
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
    • M
      mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API · a0b8cab3
      Mel Gorman 提交于
      Now that the LRU to add a page to is decided at LRU-add time, remove the
      misleading lru parameter from __pagevec_lru_add.  A consequence of this
      is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
      helpers are misleading as the caller no longer has direct control over
      what LRU the page is added to.  Unused helpers are removed by this patch
      and existing users of pagevec_lru_add_file() are converted to use
      lru_cache_add_file() directly and use the per-cpu pagevecs instead of
      creating their own pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b8cab3
    • M
      mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec · 059285a2
      Mel Gorman 提交于
      If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
      may fail to move a page to the active list as expected.  Now that the
      LRU is selected at LRU drain time, mark pages PageActive if they are on
      the local pagevec so it gets moved to the correct list at LRU drain
      time.  Using a debugging patch it was found that for a simple git
      checkout based workload that pages were never added to the active file
      list in practice but with this patch applied they are.
      
      				before   after
      LRU Add Active File                  0      750583
      LRU Add Active Anon            2640587     2702818
      LRU Add Inactive File          8833662     8068353
      LRU Add Inactive Anon              207         200
      
      Note that only pages on the local pagevec are considered on purpose.  A
      !PageLRU page could be in the process of being released, reclaimed,
      migrated or on a remote pagevec that is currently being drained.
      Marking it PageActive is vunerable to races where PageLRU and Active
      bits are checked at the wrong time.  Page reclaim will trigger
      VM_BUG_ONs but depending on when the race hits, it could also free a
      PageActive page to the page allocator and trigger a bad_page warning.
      Similarly a potential race exists between a per-cpu drain on a pagevec
      list and an activation on a remote CPU.
      
      				lru_add_drain_cpu
      				__pagevec_lru_add
      				  lru = page_lru(page);
      mark_page_accessed
        if (PageLRU(page))
          activate_page
        else
          SetPageActive
      				  SetPageLRU(page);
      				  add_page_to_lru_list(page, lruvec, lru);
      
      In this case a PageActive page is added to the inactivate list and later
      the inactive/active stats will get skewed.  While the PageActive checks
      in vmscan could be removed and potentially dealt with, a skew in the
      statistics would be very difficult to detect.  Hence this patch deals
      just with the common case where a page being marked accessed has just
      been added to the local pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      059285a2
    • M
      mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time · 13f7f789
      Mel Gorman 提交于
      mark_page_accessed() cannot activate an inactive page that is located on
      an inactive LRU pagevec.  Hints from filesystems may be ignored as a
      result.  In preparation for fixing that problem, this patch removes the
      per-LRU pagevecs and leaves just one pagevec.  The final LRU the page is
      added to is deferred until the pagevec is drained.
      
      This means that fewer pagevecs are available and potentially there is
      greater contention on the LRU lock.  However, this only applies in the
      case where there is an almost perfect mix of file, anon, active and
      inactive pages being added to the LRU.  In practice I expect that we are
      adding stream of pages of a particular time and that the changes in
      contention will barely be measurable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13f7f789
    • M
      mm: add tracepoints for LRU activation and insertions · c6286c98
      Mel Gorman 提交于
      Andrew Perepechko reported a problem whereby pages are being prematurely
      evicted as the mark_page_accessed() hint is ignored for pages that are
      currently on a pagevec --
      http://www.spinics.net/lists/linux-ext4/msg37340.html .
      
      Alexey Lyahkov and Robin Dong have also reported problems recently that
      could be due to hot pages reaching the end of the inactive list too
      quickly and be reclaimed.
      
      Rather than addressing this on a per-filesystem basis, this series aims
      to fix the mark_page_accessed() interface by deferring what LRU a page
      is added to pagevec drain time and allowing mark_page_accessed() to call
      SetPageActive on a pagevec page.
      
      Patch 1 adds two tracepoints for LRU page activation and insertion. Using
      	these processes it's possible to build a model of pages in the
      	LRU that can be processed offline.
      
      Patch 2 defers making the decision on what LRU to add a page to until when
      	the pagevec is drained.
      
      Patch 3 searches the local pagevec for pages to mark PageActive on
      	mark_page_accessed. The changelog explains why only the local
      	pagevec is examined.
      
      Patches 4 and 5 tidy up the API.
      
      postmark, a dd-based test and fs-mark both single and threaded mode were
      run but none of them showed any performance degradation or gain as a
      result of the patch.
      
      Using patch 1, I built a *very* basic model of the LRU to examine
      offline what the average age of different page types on the LRU were in
      milliseconds.  Of course, capturing the trace distorts the test as it's
      written to local disk but it does not matter for the purposes of this
      test.  The average age of pages in milliseconds were
      
      				    vanilla deferdrain
      Average age mapped anon:               1454       1250
      Average age mapped file:             127841     155552
      Average age unmapped anon:               85        235
      Average age unmapped file:            73633      38884
      Average age unmapped buffers:         74054     116155
      
      The LRU activity was mostly files which you'd expect for a dd-based
      workload.  Note that the average age of buffer pages is increased by the
      series and it is expected this is due to the fact that the buffer pages
      are now getting added to the active list when drained from the pagevecs.
      Note that the average age of the unmapped file data is decreased as they
      are still added to the inactive list and are reclaimed before the
      buffers.
      
      There is no guarantee this is a universal win for all workloads and it
      would be nice if the filesystem people gave some thought as to whether
      this decision is generally a win or a loss.
      
      This patch:
      
      Using these tracepoints it is possible to model LRU activity and the
      average residency of pages of different types.  This can be used to
      debug problems related to premature reclaim of pages of particular
      types.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6286c98
    • L
      memcg: update TODO list in Documentation · f968ef1c
      Li Zefan 提交于
      hugetlb cgroup has already been implemented.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NRob Landley <rob@landley.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f968ef1c
    • H
      vmcore: support mmap() on /proc/vmcore · 83086978
      HATAYAMA Daisuke 提交于
      This patch introduces mmap_vmcore().
      
      Don't permit writable nor executable mapping even with mprotect()
      because this mmap() is aimed at reading crash dump memory.  Non-writable
      mapping is also requirement of remap_pfn_range() when mapping linear
      pages on non-consecutive physical pages; see is_cow_mapping().
      
      Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
      remap_vmalloc_range_pertial at the same time for a single vma.
      do_munmap() can correctly clean partially remapped vma with two
      functions in abnormal case.  See zap_pte_range(), vm_normal_page() and
      their comments for details.
      
      On x86-32 PAE kernels, mmap() supports at most 16TB memory only.  This
      limitation comes from the fact that the third argument of
      remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.
      
      [akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Tested-by: NMaxim Uvarov <muvarov@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83086978
    • H
      vmcore: calculate vmcore file size from buffer size and total size of vmcore objects · 591ff716
      HATAYAMA Daisuke 提交于
      The previous patches newly added holes before each chunk of memory and
      the holes need to be count in vmcore file size.  There are two ways to
      count file size in such a way:
      
      1) suppose m is a poitner to the last vmcore object in vmcore_list.
         Then file size is (m->offset + m->size), or
      
      2) calculate sum of size of buffers for ELF header, program headers,
         ELF note segments and objects in vmcore_list.
      
      Although 1) is more direct and simpler than 2), 2) seems better in that
      it reflects internal object structure of /proc/vmcore.  Thus, this patch
      changes get_vmcore_size_elf{64, 32} so that it calculates size in the
      way of 2).
      
      As a result, both get_vmcore_size_elf{64, 32} have the same definition.
      Merge them as get_vmcore_size.
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      591ff716
    • H
      vmcore: allow user process to remap ELF note segment buffer · ef9e78fd
      HATAYAMA Daisuke 提交于
      Now ELF note segment has been copied in the buffer on vmalloc memory.
      To allow user process to remap the ELF note segment buffer with
      remap_vmalloc_page, the corresponding VM area object has to have
      VM_USERMAP flag set.
      
      [akpm@linux-foundation.org: use the conventional comment layout]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef9e78fd
    • H
      vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory · 087350c9
      HATAYAMA Daisuke 提交于
      The reasons why we don't allocate ELF note segment in the 1st kernel
      (old memory) on page boundary is to keep backward compatibility for old
      kernels, and that if doing so, we waste not a little memory due to
      round-up operation to fit the memory to page boundary since most of the
      buffers are in per-cpu area.
      
      ELF notes are per-cpu, so total size of ELF note segments depends on
      number of CPUs.  The current maximum number of CPUs on x86_64 is 5192,
      and there's already system with 4192 CPUs in SGI, where total size
      amounts to 1MB.  This can be larger in the near future or possibly even
      now on another architecture that has larger size of note per a single
      cpu.  Thus, to avoid the case where memory allocation for large block
      fails, we allocate vmcore objects on vmalloc memory.
      
      This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
      to the ELF note segment buffer and its size.  There's no longer the
      vmcore object that corresponds to the ELF note segment in vmcore_list.
      Accordingly, read_vmcore() has new case for ELF note segment and
      set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
      calculating offset from sum of size of ELF headers and size of ELF note
      segment.
      
      [akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      087350c9
    • H
      vmalloc: introduce remap_vmalloc_range_partial · e69e9d4a
      HATAYAMA Daisuke 提交于
      We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
      space and remap it to user-space in order to reduce the risk that memory
      allocation fails on system with huge number of CPUs and so with huge ELF
      note segment that exceeds 11-order block size.
      
      Although there's already remap_vmalloc_range for the purpose of
      remapping vmalloc memory to user-space, we need to specify user-space
      range via vma.
       Mmap on /proc/vmcore needs to remap range across multiple objects, so
      the interface that requires vma to cover full range is problematic.
      
      This patch introduces remap_vmalloc_range_partial that receives user-space
      range as a pair of base address and size and can be used for mmap on
      /proc/vmcore case.
      
      remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.
      
      [akpm@linux-foundation.org: use PAGE_ALIGNED()]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e69e9d4a
    • H
      vmalloc: make find_vm_area check in range · cef2ac3f
      HATAYAMA Daisuke 提交于
      Currently, __find_vmap_area searches for the kernel VM area starting at
      a given address.  This patch changes this behavior so that it searches
      for the kernel VM area to which the address belongs.  This change is
      needed by remap_vmalloc_range_partial to be introduced in later patch
      that receives any position of kernel VM area as target address.
      
      This patch changes the condition (addr > va->va_start) to the equivalent
      (addr >= va->va_end) by taking advantage of the fact that each kernel VM
      area is non-overlapping.
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cef2ac3f
    • H
      vmcore: treat memory chunks referenced by PT_LOAD program header entries in... · 7f614cd1
      HATAYAMA Daisuke 提交于
      vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
      
      Treat memory chunks referenced by PT_LOAD program header entries in
      page-size boundary in vmcore_list.  Formally, for each range [start,
      end], we set up the corresponding vmcore object in vmcore_list to
      [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
      
      This change affects layout of /proc/vmcore.  The gaps generated by the
      rearrangement are newly made visible to applications as holes.
      Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
      [end, roundup(end, PAGE_SIZE)].
      
      Suppose variable m points at a vmcore object in vmcore_list, and
      variable phdr points at the program header of PT_LOAD type the variable
      m corresponds to.  Then, pictorially:
      
        m->offset                    +---------------+
                                     | hole          |
      phdr->p_offset =               +---------------+
        m->offset + (paddr - start)  |               |\
                                     | kernel memory | phdr->p_memsz
                                     |               |/
                                     +---------------+
                                     | hole          |
        m->offset + m->size          +---------------+
      
      where m->offset and m->offset + m->size are always page-size aligned.
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f614cd1
    • H
      vmcore: allocate buffer for ELF headers on page-size alignment · f2bdacdd
      HATAYAMA Daisuke 提交于
      Allocate ELF headers on page-size boundary using __get_free_pages()
      instead of kmalloc().
      
      Later patch will merge PT_NOTE entries into a single unique one and
      decrease the buffer size actually used.  Keep original buffer size in
      variable elfcorebuf_sz_orig to kfree the buffer later and actually used
      buffer size with rounded up to page-size boundary in variable
      elfcorebuf_sz separately.
      
      The size of part of the ELF buffer exported from /proc/vmcore is
      elfcorebuf_sz.
      
      The merged, removed PT_NOTE entries, i.e.  the range [elfcorebuf_sz,
      elfcorebuf_sz_orig], is filled with 0.
      
      Use size of the ELF headers as an initial offset value in
      set_vmcore_list_offsets_elf{64,32} and
      process_ptload_program_headers_elf{64,32} in order to indicate that the
      offset includes the holes towards the page boundary.
      
      As a result, both set_vmcore_list_offsets_elf{64,32} have the same
      definition.  Merge them as set_vmcore_list_offsets.
      
      [akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2bdacdd
    • H
      vmcore: clean up read_vmcore() · b27eb186
      HATAYAMA Daisuke 提交于
      Rewrite part of read_vmcore() that reads objects in vmcore_list in the
      same way as part reading ELF headers, by which some duplicated and
      redundant codes are removed.
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b27eb186
    • A
      include/linux/mm.h: add PAGE_ALIGNED() helper · 0fa73b86
      Andrew Morton 提交于
      To test whether an address is aligned to PAGE_SIZE.
      
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fa73b86
    • C
      memory_hotplug: use pgdat_resize_lock() in __offline_pages() · d702909f
      Cody P Schafer 提交于
      mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
      follows:
      
              * Must be held any time you expect node_start_pfn, node_present_pages
              * or node_spanned_pages stay constant.  [...]
      
      So actually hold it when we update node_present_pages in __offline_pages().
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d702909f
    • C
      memory_hotplug: use pgdat_resize_lock() in online_pages() · aa47228a
      Cody P Schafer 提交于
      mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
      follows:
      
              * Must be held any time you expect node_start_pfn, node_present_pages
              * or node_spanned_pages stay constant.  [...]
      
      So actually hold it when we update node_present_pages in online_pages().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa47228a
    • C
      114d4b79
    • C
    • M
      fs: nfs: inform the VM about pages being committed or unstable · f919b196
      Mel Gorman 提交于
      VM page reclaim uses dirty and writeback page states to determine if
      flushers are cleaning pages too slowly and that page reclaim should
      stall waiting on flushers to catch up.  Page state in NFS is a bit more
      complex and a clean page can be unreclaimable due to being unstable
      which is effectively "dirty" from the perspective of the VM from reclaim
      context.  Similarly, if the inode is currently being committed then it's
      similar to being under writeback.
      
      This patch adds a is_dirty_writeback() handled for NFS that checks if a
      pages backing inode is being committed and should be accounted as
      writeback and if a page has private state indicating that it is
      effectively dirty.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f919b196
    • M
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman 提交于
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
    • M
      mm: vmscan: treat pages marked for immediate reclaim as zone congestion · d04e8acd
      Mel Gorman 提交于
      Currently a zone will only be marked congested if the underlying BDI is
      congested but if dirty pages are spread across zones it is possible that
      an individual zone is full of dirty pages without being congested.  The
      impact is that zone gets scanned very quickly potentially reclaiming
      really clean pages.  This patch treats pages marked for immediate
      reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
      and stalling in wait_iff_congested.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d04e8acd
    • M
      mm: vmscan: move direct reclaim wait_iff_congested into shrink_list · 8e950282
      Mel Gorman 提交于
      shrink_inactive_list makes decisions on whether to stall based on the
      number of dirty pages encountered.  The wait_iff_congested() call in
      shrink_page_list does no such thing and it's arbitrary.
      
      This patch moves the decision on whether to set ZONE_CONGESTED and the
      wait_iff_congested call into shrink_page_list.  This keeps all the
      decisions on whether to stall or not in the one place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e950282
    • M
      mm: vmscan: set zone flags before blocking · f7ab8db7
      Mel Gorman 提交于
      In shrink_page_list a decision may be made to stall and flag a zone as
      ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
      encountered later then the reclaimer will stall.  Set ZONE_WRITEBACK
      before potentially going to sleep so it is noticed sooner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ab8db7
    • M
      mm: vmscan: stall page reclaim after a list of pages have been processed · b1a6f21e
      Mel Gorman 提交于
      Commit "mm: vmscan: Block kswapd if it is encountering pages under
      writeback" blocks page reclaim if it encounters pages under writeback
      marked for immediate reclaim.  It blocks while pages are still isolated
      from the LRU which is unnecessary.  This patch defers the blocking until
      after the isolated pages have been processed and tidies up some of the
      comments.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1a6f21e
    • M
      mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered · e2be15f6
      Mel Gorman 提交于
      Further testing of the "Reduce system disruption due to kswapd"
      discovered a few problems.  First and foremost, it's possible for pages
      under writeback to be freed which will lead to badness.  Second, as
      pages were not being swapped the file LRU was being scanned faster and
      clean file pages were being reclaimed.  In some cases this results in
      increased read IO to re-read data from disk.  Third, more pages were
      being written from kswapd context which can adversly affect IO
      performance.  Lastly, it was observed that PageDirty pages are not
      necessarily dirty on all filesystems (buffers can be clean while
      PageDirty is set and ->writepage generates no IO) and not all
      filesystems set PageWriteback when the page is being written (e.g.
      ext3).  This disconnect confuses the reclaim stalling logic.  This
      follow-up series is aimed at these problems.
      
      The tests were based on three kernels
      
      vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
      mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
      		kswapd" applied on top as per what should be in Andrew's tree
      		right now
      lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
      
      The first test used memcached+memcachetest while some background IO was
      in progress as implemented by the parallel IO tests implement in MM
      Tests.  memcachetest benchmarks how many operations/second memcached can
      service.  It starts with no background IO on a freshly created ext4
      filesystem and then re-runs the test with larger amounts of IO in the
      background to roughly simulate a large copy in progress.  The
      expectation is that the IO should have little or no impact on
      memcachetest which is running entirely in memory.
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
      Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
      Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
      Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
      Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
      Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
      Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
      Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
      Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
      Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
      Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
      Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
      
      memcachetest is the transactions/second reported by memcachetest. In
              the vanilla kernel note that performance drops from around
              23K/sec to just over 4K/second when there is 2385M of IO going
              on in the background. With current mmotm, there is no collapse
      	in performance and with this follow-up series there is little
      	change.
      
      swaptotal is the total amount of swap traffic. With mmotm and the follow-up
      	series, the total amount of swapping is much reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  11160152    10706748    10622316
      Major Faults                     46305         755         678
      Swap Ins                        260249           0           0
      Swap Outs                       683860          18          18
      Direct pages scanned                 0         678        2520
      Kswapd pages scanned           6046108     8814900     1639279
      Kswapd pages reclaimed         1081954     1172267     1094635
      Direct pages reclaimed               0         566        2304
      Kswapd efficiency                  17%         13%         66%
      Kswapd velocity               5217.560    7618.953    1414.879
      Direct efficiency                 100%         83%         91%
      Direct velocity                  0.000       0.586       2.175
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          5105.086    6824.681     671.158
      Zone dma32 velocity            112.473     794.858     745.896
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     1929612.000 6861768.000   32821.000
      Page writes file               1245752     6861750       32803
      Page writes anon                683860          18          18
      Page reclaim immediate            7484          40         239
      Sector Reads                   1130320       93996       86900
      Sector Writes                 13508052    10823500    11804436
      Page rescued immediate               0           0           0
      Slabs scanned                    33536       27136       18560
      Direct inode steals                  0           0           0
      Kswapd inode steals               8641        1035           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                      8          37          33
      THP collapse alloc                 508         552         515
      THP splits                          24           1           1
      THP fault fallback                   0           0           0
      THP collapse fail                    0           0           0
      
      There are a number of observations to make here
      
      1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
         pages swapped were really unused anonymous pages. Related to that,
         major faults are much reduced.
      
      2. kswapd efficiency was impacted by the initial series but with these
         follow-up patches, the efficiency is now at 66% indicating that far
         fewer pages were skipped during scanning due to dirty or writeback
         pages.
      
      3. kswapd velocity is reduced indicating that fewer pages are being scanned
         with the follow-up series as kswapd now stalls when the tail of the
         LRU queue is full of unqueued dirty pages. The stall gives flushers a
         chance to catch-up so kswapd can reclaim clean pages when it wakes
      
      4. In light of Zlatko's recent reports about zone scanning imbalances,
         mmtests now reports scanning velocity on a per-zone basis. With mainline,
         you can see that the scanning activity is dominated by the Normal
         zone with over 45 times more scanning in Normal than the DMA32 zone.
         With the series currently in mmotm, the ratio is slightly better but it
         is still the case that the bulk of scanning is in the highest zone. With
         this follow-up series, the ratio of scanning between the Normal and
         DMA32 zone is roughly equal.
      
      5. As Dave Chinner observed, the current patches in mmotm increased the
         number of pages written from kswapd context which is expected to adversly
         impact IO performance. With the follow-up patches, far fewer pages are
         written from kswapd context than the mainline kernel
      
      6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
         the follow-up series, there is less slab shrinking activity and no inodes
         were reclaimed.
      
      7. Note that "Sectors Read" is drastically reduced implying that the source
         data being used for the IO is not being aggressively discarded due to
         page reclaim skipping over dirty pages and reclaiming clean pages. Note
         that the reducion in reads could also be due to inode data not being
         re-read from disk after a slab shrink.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz        166.99       32.09       33.44
      Mean sda-await        853.64      192.76      185.43
      Mean sda-r_await        6.31        9.24        5.97
      Mean sda-w_await     2992.81      202.65      192.43
      Max  sda-avgqz       1409.91      718.75      698.98
      Max  sda-await       6665.74     3538.00     3124.23
      Max  sda-r_await       58.96      111.95       58.00
      Max  sda-w_await    28458.94     3977.29     3148.61
      
      In light of the changes in writes from reclaim context, the number of
      reads and Dave Chinner's concerns about IO performance I took a closer
      look at the IO stats for the test disk. Few observations
      
      1. The average queue size is reduced by the initial series and roughly
         the same with this follow up.
      
      2. Average wait times for writes are reduced and as the IO
         is completing faster it at least implies that the gain is because
         flushers are writing the files efficiently instead of page reclaim
         getting in the way.
      
      3. The reduction in maximum write latency is staggering. 28 seconds down
         to 3 seconds.
      
      Jan Kara asked how NFS is affected by all of this. Unstable pages can
      be taken into account as one of the patches in the series shows but it
      is still the case that filesystems with unusual handling of dirty or
      writeback could still be treated better.
      
      Tests like postmark, fsmark and largedd showed up nothing useful. On my test
      setup, pages are simply not being written back from reclaim context with or
      without the patches and there are no changes in performance. My test setup
      probably is just not strong enough network-wise to be really interesting.
      
      I ran a longer-lived memcached test with IO going to NFS instead of a local disk
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
      Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
      Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
      Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
      Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
      Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
      Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
      Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
      Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
      Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
      Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
      Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
      Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
      Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
      Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
      Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
      Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
      
      1. Performance does not collapse due to IO which is good. IO is also completing
         faster. Note with mmotm, IO completes in a third of the time and faster again
         with this series applied
      
      2. Swapping is reduced, although not eliminated. The figures for the follow-up
         look bad but it does vary a bit as the stalling is not perfect for nfs
         or filesystems like ext3 with unusual handling of dirty and writeback
         pages
      
      3. There are swapins, particularly with larger amounts of IO indicating
         that active pages are being reclaimed. However, the number of much
         reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  36339175    35025445    35219699
      Major Faults                    310964       27108       51887
      Swap Ins                       2176399      173069      333316
      Swap Outs                      3344050      357228      504824
      Direct pages scanned              8972       77283       43242
      Kswapd pages scanned          20899983     8939566    14772851
      Kswapd pages reclaimed         6193156     5172605     5231026
      Direct pages reclaimed            8450       73802       39514
      Kswapd efficiency                  29%         57%         35%
      Kswapd velocity               3929.743    1847.499    3058.840
      Direct efficiency                  94%         95%         91%
      Direct velocity                  1.687      15.972       8.954
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          3721.907     939.103    2185.142
      Zone dma32 velocity            209.522     924.368     882.651
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     4082185.000  526319.000  537114.000
      Page writes file                738135      169091       32290
      Page writes anon               3344050      357228      504824
      Page reclaim immediate            9524         170     5595843
      Sector Reads                   8909900      861192     1483680
      Sector Writes                 13428980     1488744     2076800
      Page rescued immediate               0           0           0
      Slabs scanned                    38016       31744       28672
      Direct inode steals                  0           0           0
      Kswapd inode steals                424           0           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                     14          15         119
      THP collapse alloc                1767        1569        1618
      THP splits                          30          29          25
      THP fault fallback                   0           0           0
      THP collapse fail                    8           5           0
      Compaction stalls                   17          41         100
      Compaction success                   7          31          95
      Compaction failures                 10          10           5
      Page migrate success              7083       22157       62217
      Page migrate failure                 0           0           0
      Compaction pages isolated        14847       48758      135830
      Compaction migrate scanned       18328       48398      138929
      Compaction free scanned        2000255      355827     1720269
      Compaction cost                      7          24          68
      
      I guess the main takeaway again is the much reduced page writes
      from reclaim context and reduced reads.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz         23.58        0.35        0.44
      Mean sda-await        133.47       15.72       15.46
      Mean sda-r_await        4.72        4.69        3.95
      Mean sda-w_await      507.69       28.40       33.68
      Max  sda-avgqz        680.60       12.25       23.14
      Max  sda-await       3958.89      221.83      286.22
      Max  sda-r_await       63.86       61.23       67.29
      Max  sda-w_await    11710.38      883.57     1767.28
      
      And as before, write wait times are much reduced.
      
      This patch:
      
      The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
      encountered, not priority" decides whether to writeback pages from reclaim
      context based on the number of dirty pages encountered.  This situation is
      flagged too easily and flushers are not given the chance to catch up
      resulting in more pages being written from reclaim context and potentially
      impacting IO performance.  The check for PageWriteback is also misplaced
      as it happens within a PageDirty check which is nonsense as the dirty may
      have been cleared for IO.  The accounting is updated very late and pages
      that are already under writeback, were reactivated, could not unmapped or
      could not be released are all missed.  Similarly, a page is considered
      congested for reasons other than being congested and pages that cannot be
      written out in the correct context are skipped.  Finally, it considers
      stalling and writing back filesystem pages due to encountering dirty
      anonymous pages at the tail of the LRU which is dumb.
      
      This patch causes kswapd to begin writing filesystem pages from reclaim
      context only if page reclaim found that all filesystem pages at the tail
      of the LRU were unqueued dirty pages.  Before it starts writing filesystem
      pages, it will stall to give flushers a chance to catch up.  The decision
      on whether wait_iff_congested is also now determined by dirty filesystem
      pages only.  Congested pages are based on whether the underlying BDI is
      congested regardless of the context of the reclaiming process.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2be15f6
    • M
      mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone() · 7c954f6d
      Mel Gorman 提交于
      balance_pgdat() is very long and some of the logic can and should be
      internal to kswapd_shrink_zone().  Move it so the flow of
      balance_pgdat() is marginally easier to follow.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c954f6d
    • M
      mm: vmscan: check if kswapd should writepage once per pgdat scan · b7ea3c41
      Mel Gorman 提交于
      Currently kswapd checks if it should start writepage as it shrinks each
      zone without taking into consideration if the zone is balanced or not.
      This is not wrong as such but it does not make much sense either.  This
      patch checks once per pgdat scan if kswapd should be writing pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ea3c41
    • M
      mm: vmscan: block kswapd if it is encountering pages under writeback · 283aba9f
      Mel Gorman 提交于
      Historically, kswapd used to congestion_wait() at higher priorities if
      it was not making forward progress.  This made no sense as the failure
      to make progress could be completely independent of IO.  It was later
      replaced by wait_iff_congested() and removed entirely by commit 258401a6
      (mm: don't wait on congested zones in balance_pgdat()) as it was
      duplicating logic in shrink_inactive_list().
      
      This is problematic.  If kswapd encounters many pages under writeback
      and it continues to scan until it reaches the high watermark then it
      will quickly skip over the pages under writeback and reclaim clean young
      pages or push applications out to swap.
      
      The use of wait_iff_congested() is not suited to kswapd as it will only
      stall if the underlying BDI is really congested or a direct reclaimer
      was unable to write to the underlying BDI.  kswapd bypasses the BDI
      congestion as it sets PF_SWAPWRITE but even if this was taken into
      account then it would cause direct reclaimers to stall on writeback
      which is not desirable.
      
      This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
      encountering too many pages under writeback.  If this flag is set and
      kswapd encounters a PageReclaim page under writeback then it'll assume
      that the LRU lists are being recycled too quickly before IO can complete
      and block waiting for some IO to complete.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283aba9f