1. 26 2月, 2013 1 次提交
  2. 25 2月, 2013 13 次提交
  3. 24 2月, 2013 26 次提交
    • H
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins 提交于
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • Z
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei 提交于
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • Z
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei 提交于
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • C
      mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate. · da3649e1
      Cody P Schafer 提交于
      Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
      zone_*() functions.
      
      Change node_end_pfn() to be a wrapper of pgdat_end_pfn().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da3649e1
    • C
      mm: add zone_is_empty() and zone_is_initialized() · 2a6e3ebe
      Cody P Schafer 提交于
      Factoring out these 2 checks makes it more clear what we are actually
      checking for.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6e3ebe
    • C
      mm: add & use zone_end_pfn() and zone_spans_pfn() · 108bcc96
      Cody P Schafer 提交于
      Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
      duplication.
      
      This also switches to using them in compaction (where an additional
      variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
      kmemleak.
      
      Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
      because I expect at some point the sycronization issues with start_pfn &
      spanned_pages will need fixing, either by actually using the seqlock or
      clever memory barrier usage.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108bcc96
    • C
      mm: add SECTION_IN_PAGE_FLAGS · 9127ab4f
      Cody P Schafer 提交于
      Instead of directly utilizing a combination of config options to determine
      this, add a macro to specifically address it.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9127ab4f
    • J
      mm: refactor inactive_file_is_low() to use get_lru_size() · e3790144
      Johannes Weiner 提交于
      An inactive file list is considered low when its active counterpart is
      bigger, regardless of whether it is a global zone LRU list or a memcg
      zone LRU list.  The only difference is in how the LRU size is assessed.
      
      get_lru_size() does the right thing for both global and memcg reclaim
      situations.
      
      Get rid of inactive_file_is_low_global() and
      mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
      the numbers in common code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3790144
    • H
      mm: remove offlining arg to migrate_pages · 9c620e2b
      Hugh Dickins 提交于
      No functional change, but the only purpose of the offlining argument to
      migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
      KSM page for memory hotremove (which took ksm_thread_mutex) but not for
      other callers.  Now all cases are safe, remove the arg.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c620e2b
    • H
      ksm: remove old stable nodes more thoroughly · cbf86cfe
      Hugh Dickins 提交于
      Switching merge_across_nodes after running KSM is liable to oops on stale
      nodes still left over from the previous stable tree.  It's not something
      that people will often want to do, but it would be lame to demand a reboot
      when they're trying to determine which merge_across_nodes setting is best.
      
      How can this happen?  We only permit switching merge_across_nodes when
      pages_shared is 0, and usually set run 2 to force that beforehand, which
      ought to unmerge everything: yet oopses still occur when you then run 1.
      
      Three causes:
      
      1. The old stable tree (built according to the inverse
         merge_across_nodes) has not been fully torn down.  A stable node
         lingers until get_ksm_page() notices that the page it references no
         longer references it: but the page is not necessarily freed as soon as
         expected, particularly when swapcache.
      
         Fix this with a pass through the old stable tree, applying
         get_ksm_page() to each of the remaining nodes (most found stale and
         removed immediately), with forced removal of any left over.  Unless the
         page is still mapped: I've not seen that case, it shouldn't occur, but
         better to WARN_ON_ONCE and EBUSY than BUG.
      
      2. __ksm_enter() has a nice little optimization, to insert the new mm
         just behind ksmd's cursor, so there's a full pass for it to stabilize
         (or be removed) before ksmd addresses it.  Nice when ksmd is running,
         but not so nice when we're trying to unmerge all mms: we were missing
         those mms forked and inserted behind the unmerge cursor.  Easily fixed
         by inserting at the end when KSM_RUN_UNMERGE.
      
      3.  It is possible for a KSM page to be faulted back from swapcache
         into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
         it.  Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
         private to ksm.c, so dissolve the distinction between
         ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
         the one call into ksm.c.
      
      A long outstanding, unrelated bugfix sneaks in with that third fix:
      ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
      error when read in from swap) to a page which it then marks Uptodate.  Fix
      this case by not copying, letting do_swap_page() discover the error.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbf86cfe
    • M
      mm: rename page struct field helpers · 22b751c3
      Mel Gorman 提交于
      The function names page_xchg_last_nid(), page_last_nid() and
      reset_page_last_nid() were judged to be inconsistent so rename them to a
      struct_field_op style pattern.  As it looked jarring to have
      reset_page_mapcount() and page_nid_reset_last() beside each other in
      memmap_init_zone(), this patch also renames reset_page_mapcount() to
      page_mapcount_reset().  There are others like init_page_count() but as
      it is used throughout the arch code a rename would likely cause more
      conflicts than it is worth.
      
      [akpm@linux-foundation.org: fix zcache]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22b751c3
    • M
      mm: uninline page_xchg_last_nid() · 4468b8f1
      Mel Gorman 提交于
      Andrew Morton pointed out that page_xchg_last_nid() and
      reset_page_last_nid() were "getting nuttily large" and asked that it be
      investigated.
      
      reset_page_last_nid() is on the page free path and it would be
      unfortunate to make that path more expensive than it needs to be.  Due
      to the internal use of page_xchg_last_nid() it is already too expensive
      but fortunately, it should also be impossible for the page->flags to be
      updated in parallel when we call reset_page_last_nid().  Instead of
      unlining the function, it uses a simplier implementation that assumes no
      parallel updates and should now be sufficiently short for inlining.
      
      page_xchg_last_nid() is called in paths that are already quite expensive
      (splitting huge page, fault handling, migration) and it is reasonable to
      uninline.  There was not really a good place to place the function but
      mm/mmzone.c was the closest fit IMO.
      
      This patch saved 128 bytes of text in the vmlinux file for the kernel
      configuration I used for testing automatic NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4468b8f1
    • P
      page-writeback.c: subtract min_free_kbytes from dirtyable memory · 75f7ad8e
      Paul Szabo 提交于
      When calculating amount of dirtyable memory, min_free_kbytes should be
      subtracted because it is not intended for dirty pages.
      
      Addresses http://bugs.debian.org/695182
      
      [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
      [akpm@linux-foundation.org: fix min() warning]
      Signed-off-by: NPaul Szabo <psz@maths.usyd.edu.au>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75f7ad8e
    • K
      mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() · 08b52706
      Konstantin Khlebnikov 提交于
      The comment in commit 4fc3f1d6 ("mm/rmap, migration: Make
      rmap_walk_anon() and try_to_unmap_anon() more scalable") says:
      
      | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      | to make it clearer that it's an exclusive write-lock in
      | that case - suggested by Rik van Riel.
      
      But that commit renames only anon_vma_lock()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08b52706
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • S
      swap: make each swap partition have one address_space · 33806f06
      Shaohua Li 提交于
      When I use several fast SSD to do swap, swapper_space.tree_lock is
      heavily contended.  This makes each swap partition have one
      address_space to reduce the lock contention.  There is an array of
      address_space for swap.  The swap entry type is the index to the array.
      
      In my test with 3 SSD, this increases the swapout throughput 20%.
      
      [akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33806f06
    • S
      mm: don't inline page_mapping() · 9800339b
      Shaohua Li 提交于
      According to akpm, this saves 1/2k text and makes things simple for the
      next patch.
      
      Numbers from Minchan:
      
      add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
      function                                     old     new   delta
      page_mapping                                   -      48     +48
      do_task_stat                                2292    2308     +16
      page_remove_rmap                             240     248      +8
      load_elf_binary                             4500    4508      +8
      update_queue                                 532     536      +4
      scsi_probe_and_add_lun                      2892    2896      +4
      lookup_fast                                  644     648      +4
      vcs_read                                    1040    1036      -4
      __ip_route_output_key                       1904    1900      -4
      ip_route_input_noref                        2508    2500      -8
      shmem_file_aio_read                          784     772     -12
      __isolate_lru_page                           272     256     -16
      shmem_replace_page                           708     688     -20
      mark_buffer_dirty                            228     208     -20
      __set_page_dirty_buffers                     240     220     -20
      __remove_mapping                             276     256     -20
      update_mmu_cache                             500     476     -24
      set_page_dirty_balance                        92      68     -24
      set_page_dirty                               172     148     -24
      page_evictable                                88      64     -24
      page_cache_pipe_buf_steal                    248     224     -24
      clear_page_dirty_for_io                      340     316     -24
      test_set_page_writeback                      400     372     -28
      test_clear_page_writeback                    516     488     -28
      invalidate_inode_page                        156     128     -28
      page_mkclean                                 432     400     -32
      flush_dcache_page                            360     328     -32
      __set_page_dirty_nobuffers                   324     280     -44
      shrink_page_list                            2412    2356     -56
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9800339b
    • P
      mm: fold page->_last_nid into page->flags where possible · 75980e97
      Peter Zijlstra 提交于
      page->_last_nid fits into page->flags on 64-bit.  The unlikely 32-bit
      NUMA configuration with NUMA Balancing will still need an extra page
      field.  As Peter notes "Completely dropping 32bit support for
      CONFIG_NUMA_BALANCING would simplify things, but it would also remove
      the warning if we grow enough 64bit only page-flags to push the last-cpu
      out."
      
      [mgorman@suse.de: minor modifications]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75980e97
    • P
      mm: move page flags layout to separate header · bbeae5b0
      Peter Zijlstra 提交于
      This is a preparation patch for moving page->_last_nid into page->flags
      that moves page flag layout information to a separate header.  This
      patch is necessary because otherwise there would be a circular
      dependency between mm_types.h and mm.h.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbeae5b0
    • M
      mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING · 3c0ff468
      Mel Gorman 提交于
      The current definitions for count_vm_numa_events() is wrong for
      !CONFIG_NUMA_BALANCING as the following would miss the side-effect.
      
      	count_vm_numa_events(NUMA_FOO, bar++);
      
      There are no such users of count_vm_numa_events() but this patch fixes
      it as it is a potential pitfall.  Ideally both would be converted to
      static inline but NUMA_PTE_UPDATES is not defined if
      !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
      static inline would be similarly clumsy.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0ff468
    • M
      mm: numa: fix minor typo in numa_next_scan · 34f0315a
      Mel Gorman 提交于
      s/me/be/ and clarify the comment a bit when we're changing it anyway.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NSimon Jeons <simon.jeons@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f0315a
    • K
    • M
      pm / runtime: introduce pm_runtime_set_memalloc_noio() · e823407f
      Ming Lei 提交于
      Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
      to teach mm not allocating memory with GFP_KERNEL flag for avoiding
      probable deadlock.
      
      As explained in the comment, any GFP_KERNEL allocation inside
      runtime_resume() or runtime_suspend() on any one of device in the path
      from one block or network device to the root device in the device tree
      may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
      or clears the flag on device in the path recursively.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e823407f
    • M
      mm: teach mm by current context info to not do I/O during memory allocation · 21caf2fc
      Ming Lei 提交于
      This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
      'struct task_struct'), so that the flag can be set by one task to avoid
      doing I/O inside memory allocation in the task's context.
      
      The patch trys to solve one deadlock problem caused by block device, and
      the problem may happen at least in the below situations:
      
      - during block device runtime resume, if memory allocation with
        GFP_KERNEL is called inside runtime resume callback of any one of its
        ancestors(or the block device itself), the deadlock may be triggered
        inside the memory allocation since it might not complete until the block
        device becomes active and the involed page I/O finishes.  The situation
        is pointed out first by Alan Stern.  It is not a good approach to
        convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
        subsystems may be involved(for example, PCI, USB and SCSI may be
        involved for usb mass stoarage device, network devices involved too in
        the iSCSI case)
      
      - during block device runtime suspend, because runtime resume need to
        wait for completion of concurrent runtime suspend.
      
      - during error handling of usb mass storage deivce, USB bus reset will
        be put on the device, so there shouldn't have any memory allocation with
        GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
        above may be triggered.  Unfortunately, any usb device may include one
        mass storage interface in theory, so it requires all usb interface
        drivers to handle the situation.  In fact, most usb drivers don't know
        how to handle bus reset on the device and don't provide .pre_set() and
        .post_reset() callback at all, so USB core has to unbind and bind driver
        for these devices.  So it is still not practical to resort to GFP_NOIO
        for solving the problem.
      
      Also the introduced solution can be used by block subsystem or block
      drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
      actual I/O transfer.
      
      It is not a good idea to convert all these GFP_KERNEL in the affected
      path into GFP_NOIO because these functions doing that may be implemented
      as library and will be called in many other contexts.
      
      In fact, memalloc_noio_flags() can convert some of current static
      GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
      at least almost all GFP_NOIO in USB subsystem can be converted into
      GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
      only happen in runtime resume/bus reset/block I/O transfer contexts
      generally.
      
      [1], several GFP_KERNEL allocation examples in runtime resume path
      
      - pci subsystem
      acpi_os_allocate
      	<-acpi_ut_allocate
      		<-ACPI_ALLOCATE_ZEROED
      			<-acpi_evaluate_object
      				<-__acpi_bus_set_power
      					<-acpi_bus_set_power
      						<-acpi_pci_set_power_state
      							<-platform_pci_set_power_state
      								<-pci_platform_power_transition
      									<-__pci_complete_power_transition
      										<-pci_set_power_state
      											<-pci_restore_standard_config
      												<-pci_pm_runtime_resume
      - usb subsystem
      usb_get_status
      	<-finish_port_resume
      		<-usb_port_resume
      			<-generic_resume
      				<-usb_resume_device
      					<-usb_resume_both
      						<-usb_runtime_resume
      
      - some individual usb drivers
      usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
      
      That is just what I have found.  Unfortunately, this allocation can only
      be found by human being now, and there should be many not found since
      any function in the resume path(call tree) may allocate memory with
      GFP_KERNEL.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21caf2fc