1. 30 5月, 2012 35 次提交
    • B
      swiotlb: print physical addresses consistently with other parts of kernel · 3af684c7
      Bjorn Helgaas 提交于
      Print swiotlb info in a style consistent with the %pR style used elsewhere
      in the kernel.  For example:
      
          -Placing 64MB software IO TLB between ffff88007a662000 - ffff88007e662000
          -software IO TLB at phys 0x7a662000 - 0x7e662000
          +software IO TLB [mem 0x7a662000-0x7e661fff] (64MB) mapped at [ffff88007a662000-ffff88007e661fff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3af684c7
    • B
      x86: print physical addresses consistently with other parts of kernel · 365811d6
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -found SMP MP-table at [ffff8800000fce90] fce90
          +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90]
          -initial memory mapped : 0 - 20000000
          +initial memory mapped: [mem 0x00000000-0x1fffffff]
          -Base memory trampoline at [ffff88000009c000] 9c000 size 8192
          +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000]
          -SRAT: Node 0 PXM 0 0-80000000
          +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      365811d6
    • B
      x86: print e820 physical addresses consistently with other parts of kernel · 91eb0f67
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -BIOS-provided physical RAM map:
          +e820: BIOS-provided physical RAM map:
          - BIOS-e820: 0000000000000100 - 000000000009e000 (usable)
          +BIOS-e820: [mem 0x0000000000000100-0x000000000009dfff] usable
          -Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
          +e820: [mem 0x90000000-0xfed1bfff] available for PCI devices
          -reserve RAM buffer: 000000000009e000 - 000000000009ffff
          +e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91eb0f67
    • K
      bug: completely remove code generated by disabled VM_BUG_ON() · 02602a18
      Konstantin Khlebnikov 提交于
      Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON()
      
      for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in
      do_huge_pmd_wp_page() generates 114 bytes of code.
      
      But they mostly disappears when I split this VM_BUG_ON into two:
      
        -VM_BUG_ON(!PageCompound(page) || !PageHead(page));
        +VM_BUG_ON(!PageCompound(page));
        +VM_BUG_ON(!PageHead(page));
      
      weird... but anyway after this patch code disappears completely.
      
        add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649)
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02602a18
    • K
      bug: introduce BUILD_BUG_ON_INVALID() macro · baf05aa9
      Konstantin Khlebnikov 提交于
      Sometimes we want to check some expressions correctness at compile time.
      "(void)(e);" or "if (e);" can be dangerous if the expression has
      side-effects, and gcc sometimes generates a lot of code, even if the
      expression has no effect.
      
      This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it
      forces a compilation error if expression is invalid without any extra
      code.
      
      [Cast to "long" required because sizeof does not work for bit-fields.]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf05aa9
    • C
      Cross Memory Attach: make it Kconfigurable · 5febcbe9
      Christopher Yeoh 提交于
      Add a Kconfig option to allow people who don't want cross memory attach to
      not have it included in their build.
      Signed-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5febcbe9
    • J
      Documentation: memcg: future proof hierarchical statistics documentation · eb6332a5
      Johannes Weiner 提交于
      The hierarchical versions of per-memcg counters in memory.stat are all
      calculated the same way and are all named total_<counter>.
      
      Documenting the pattern is easier for maintenance than listing each
      counter twice.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Randy Dunlap <rdunlap@xenotime.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb6332a5
    • D
      mm, thp: drop page_table_lock to uncharge memcg pages · 6f60b69d
      David Rientjes 提交于
      mm->page_table_lock is hotly contested for page fault tests and isn't
      necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f60b69d
    • Y
      mm: rename is_mlocked_vma() to mlocked_vma_newpage() · 096a7cf4
      Ying Han 提交于
      Andrew pointed out that the is_mlocked_vma() is misnamed.  A function
      with name like that would expect bool return and no side-effects.
      
      Since it is called on the fault path for new page, rename it in this
      patch.
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      096a7cf4
    • J
      mm: memcg: count pte references from every member of the reclaimed hierarchy · c3ac9a8a
      Johannes Weiner 提交于
      The rmap walker checking page table references has historically ignored
      references from VMAs that were not part of the memcg that was being
      reclaimed during memcg hard limit reclaim.
      
      When transitioning global reclaim to memcg hierarchy reclaim, I missed
      that bit and now references from outside a memcg are ignored even during
      global reclaim.
      
      Reverting back to traditional behaviour - count all references during
      global reclaim and only mind references of the memcg being reclaimed
      during limit reclaim would be one option.
      
      However, the more generic idea is to ignore references exactly then when
      they are outside the hierarchy that is currently under reclaim; because
      only then will their reclamation be of any use to help the pressure
      situation.  It makes no sense to ignore references from a sibling memcg
      and then evict a page that will be immediately refaulted by that sibling
      which contributes to the same usage of the common ancestor under
      reclaim.
      
      The solution: make the rmap walker ignore references from VMAs that are
      not part of the hierarchy that is being reclaimed.
      
      Flat limit reclaim will stay the same, hierarchical limit reclaim will
      mind the references only to pages that the hierarchy owns.  Global
      reclaim, since it reclaims from all memcgs, will be fixed to regard all
      references.
      
      [akpm@linux-foundation.org: name the args in the declaration]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ac9a8a
    • J
      kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite · 91c63734
      Johannes Weiner 提交于
      Library functions should not grab locks when the callsites can do it,
      even if the lock nests like the rcu read-side lock does.
      
      Push the rcu_read_lock() from css_is_ancestor() to its single user,
      mem_cgroup_same_or_subtree() in preparation for another user that may
      already hold the rcu read-side lock.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91c63734
    • A
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton 提交于
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • L
      mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node · 4a5b18cc
      Larry Woodman 提交于
      While running an application that moves tasks from one cpuset to another
      I noticed that it takes much longer and moves many more pages than
      expected.
      
      The reason for this is do_migrate_pages() does its best to preserve the
      relative node differential from the first node of the cpuset because the
      application may have been written with that in mind.  If memory was
      interleaved on the nodes of the source cpuset by an application
      do_migrate_pages() will try its best to maintain that interleaving on
      the nodes of the destination cpuset.  This means copying the memory from
      all source nodes to the destination nodes even if the source and
      destination nodes overlap.
      
      This is a problem for userspace NUMA placement tools.  The amount of
      time spent doing extra memory moves cancels out some of the NUMA
      performance improvements.  Furthermore, if the number of source and
      destination nodes are to maintain the previous interleaving layout
      anyway.
      
      This patch changes do_migrate_pages() to only preserve the relative
      layout inside the program if the number of NUMA nodes in the source and
      destination mask are the same.  If the number is different, we do a much
      more efficient migration by not touching memory that is in an allowed
      node.
      
      This preserves the old behaviour for programs that want it, while
      allowing a userspace NUMA placement tool to use the new, faster
      migration.  This improves performance in our tests by up to a factor of
      7.
      
      Without this change migrating tasks from a cpuset containing nodes 0-7
      to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
      they are in the both the source and destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 4 to 3
         Migrating 1 to 4
         Migrating 3 to 4
         Migrating 0 to 3
         Migrating 2 to 3
      
      With this change we only migrate from nodes that are not in the
      destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 2 to 3
         Migrating 1 to 4
         Migrating 0 to 3
      
      Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
      containing 3,4,5 we still do move everything so that we preserve the
      desired NUMA offsets:
      
         Migrating 4 to 5
         Migrating 3 to 4
         Migrating 2 to 3
      
      As far as performance is concerned this simple patch improves the time
      it takes to move 14, 20 and 26 large tasks from a cpuset containing
      nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
      Here are the timings with and without the patch:
      
      BEFORE PATCH -- Move times: 59, 140, 651 seconds
      ============
      
        Moving 14 tasks from nodes (0-7) to nodes (1,3)
        numad(8780) do_migrate_pages (mm=0xffff88081d414400
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 8890 moved to node(s) 1,3 in 59.2 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
        numad(8780) do_migrate_pages (mm=0xffff88081d88c700
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 8962 moved to node(s) 1,4-5 in 139.88 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
        numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 9058 moved to node(s) 1-3,5 in 651.45 seconds
      
      AFTER PATCH -- Move times: 42, 56, 93 seconds
      ===========
      
        Moving 14 tasks from nodes (0-7) to nodes (5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 33221 moved to node(s) 5,7 in 41.67 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
        numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 33289 moved to node(s) 1,3,5 in 56.3 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d924400
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds
      
      [akpm@linux-foundation.org: clean up comment layout]
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a5b18cc
    • D
      thp, memcg: split hugepage for memcg oom on cow · 1f1d06c3
      David Rientjes 提交于
      On COW, a new hugepage is allocated and charged to the memcg.  If the
      system is oom or the charge to the memcg fails, however, the fault
      handler will return VM_FAULT_OOM which results in an oom kill.
      
      Instead, it's possible to fallback to splitting the hugepage so that the
      COW results only in an order-0 page being allocated and charged to the
      memcg which has a higher liklihood to succeed.  This is expensive
      because the hugepage must be split in the page fault handler, but it is
      much better than unnecessarily oom killing a process.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f1d06c3
    • S
      mm/vmstat.c: remove debug fs entries on failure of file creation and made... · bde8bd8a
      Sasikantha babu 提交于
      mm/vmstat.c: remove debug fs entries on failure of file creation and made extfrag_debug_root dentry local
      
      Remove debug fs files and directory on failure.  Since no one is using
      "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it
      local to the function.
      Signed-off-by: NSasikantha babu <sasikanth.v19@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde8bd8a
    • S
      mm/fork: fix overflow in vma length when copying mmap on clone · 7edc8b0a
      Siddhesh Poyarekar 提交于
      The vma length in dup_mmap is calculated and stored in a unsigned int,
      which is insufficient and hence overflows for very large maps (beyond
      16TB). The following program demonstrates this:
      
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/mman.h>
      
      #define GIG 1024 * 1024 * 1024L
      #define EXTENT 16393
      
      int main(void)
      {
              int i, r;
              void *m;
              char buf[1024];
      
              for (i = 0; i < EXTENT; i++) {
                      m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
                               PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      
                      if (m == (void *)-1)
                              printf("MMAP Failed: %d\n", m);
                      else
                              printf("%d : MMAP returned %p\n", i, m);
      
                      r = fork();
      
                      if (r == 0) {
                              printf("%d: successed\n", i);
                              return 0;
                      } else if (r < 0)
                              printf("FORK Failed: %d\n", r);
                      else if (r > 0)
                              wait(NULL);
              }
              return 0;
      }
      
      Increase the storage size of the result to unsigned long, which is
      sufficient for storing the difference between addresses.
      Signed-off-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7edc8b0a
    • R
      mm/mmap.c: find_vma(): remove unnecessary if(mm) check · 841e31e5
      Rajman Mekaco 提交于
      The "if (mm)" check is not required in find_vma, as the kernel code
      calls find_vma only when it is absolutely sure that the mm_struct arg to
      it is non-NULL.
      
      Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now.  This
      will serve the purpose of mandating that the execution
      context(user-mode/kernel-mode) be known before find_vma is called.  Also
      fixed 2 checkpatch.pl errors in the declaration of the rb_node and
      vma_tmp local variables.
      
      I was browsing through the internet and read a discussion at
      https://lkml.org/lkml/2012/3/27/342 which discusses removal of the
      validation check within find_vma.  Since no-one responded, I decided to
      send this patch with Andrew's suggestions.
      
      [akpm@linux-foundation.org: add remove-me comment]
      Signed-off-by: NRajman Mekaco <rajman.mekaco@gmail.com>
      Cc: Kautuk Consul <consul.kautuk@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      841e31e5
    • T
      mm: use kcalloc() instead of kzalloc() to allocate array · 4d67d860
      Thomas Meyer 提交于
      The advantage of kcalloc is, that will prevent integer overflows which
      could result from the multiplication of number of elements and size and
      it is also a bit nicer to read.
      
      The semantic patch that makes this change is available in
      https://lkml.org/lkml/2011/11/25/107Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d67d860
    • R
      mm: fix off-by-one bug in print_nodes_state() · f6238818
      Ryota Ozaki 提交于
      /sys/devices/system/node/{online,possible} outputs a garbage byte
      because print_nodes_state() returns content size + 1.  To fix the bug,
      the patch changes the use of cpuset_sprintf_cpulist to follow the use at
      other places, which is clearer and safer.
      
      This bug was introduced in v2.6.24 (commit bde631a5: "mm: add node
      states sysfs class attributeS").
      Signed-off-by: NRyota Ozaki <ozaki.ryota@gmail.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6238818
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
    • D
      mm, thp: allow fallback when pte_alloc_one() fails for huge pmd · edad9d2c
      David Rientjes 提交于
      The transparent hugepages feature is careful to not invoke the oom
      killer when a hugepage cannot be allocated.
      
      pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however,
      currently results in VM_FAULT_OOM which invokes the pagefault oom killer
      to kill a memory-hogging task.
      
      This is unnecessary since it's possible to drop the reference to the
      hugepage and fallback to allocating a small page.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edad9d2c
    • D
      mm, thp: remove unnecessary ret variable · aa2e878e
      David Rientjes 提交于
      The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so
      remove it.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa2e878e
    • W
      mm/hugetlb.c: use long vars instead of int in region_count() · f2135a4a
      Wang Sheng-Hui 提交于
      The arguments f & t and fields from & to of struct file_region are
      defined as long.  So use long instead of int to type the temp vars.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2135a4a
    • W
      mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy() · 89c522c7
      Wang Sheng-Hui 提交于
      We have enum definition in mempolicy.h: MPOL_REBIND_ONCE.  It should
      replace the magic number 0 for step comparison in function
      mpol_rebind_policy.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c522c7
    • B
      mm/memory_failure: let the compiler add the function name · 71dd0b8a
      Borislav Petkov 提交于
      These things tend to get out of sync with time so let the compiler
      automatically enter the current function name using __func__.
      
      No functional change.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      Acked-by: NAndi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71dd0b8a
    • S
      mm: fix NULL ptr deref when walking hugepages · 08fa29d9
      Sasha Levin 提交于
      A missing validation of the value returned by find_vma() could cause a
      NULL ptr dereference when walking the pagetable.
      
      This is triggerable from usermode by a simple user by trying to read a
      page info out of /proc/pid/pagemap which doesn't exist.
      
      Introduced by commit 025c5b24 ("thp: optimize away unnecessary page
      table locking").
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08fa29d9
    • C
      cris: select GENERIC_ATOMIC64 · 4c9c6a1b
      Cong Wang 提交于
      Cris doesn't implement atomic64 operations neither, should select
      GENERIC_ATOMIC64.
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c9c6a1b
    • P
      pagemap.h: fix warning about possibly used before init var · af2e8409
      Paul Gortmaker 提交于
      Commit f56f821f ("mm: extend prefault helpers to fault in more than
      PAGE_SIZE") added in the new functions: fault_in_multipages_writeable()
      and fault_in_multipages_readable().
      
      However, we currently see:
      
        include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function
        include/linux/pagemap.h:492: note: 'ret' was declared here
      
      Unlike a lot of gcc nags, this one appears somewhat legit.  i.e.  passing
      in an invalid negative value of "size" does make it look like all the
      conditionals in there would be bypassed and the uninitialized value would
      be returned.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af2e8409
    • L
      Merge tag 'mfd-3.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 · 4b781474
      Linus Torvalds 提交于
      Pull MFD changes from Samuel Ortiz:
       "Besides the usual cleanups, this one brings:
      
         * Support for 5 new chipsets: Intel's ICH LPC and SCH Centerton,
           ST-E's STAX211, Samsung's MAX77693 and TI's LM3533.
      
         * Device tree support for the twl6040, tps65910, da9502 and ab8500
           drivers.
      
         * Fairly big tps56910, ab8500 and db8500 updates.
      
         * i2c support for mc13xxx.
      
         * Our regular update for the wm8xxx driver from Mark."
      
      Fix up various conflicts with other trees, largely due to ab5500 removal
      etc.
      
      * tag 'mfd-3.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (106 commits)
        mfd: Fix build break of max77693 by adding REGMAP_I2C option
        mfd: Fix twl6040 build failure
        mfd: Fix max77693 build failure
        mfd: ab8500-core should depend on MFD_DB8500_PRCMU
        gpio: tps65910: dt: process gpio specific device node info
        mfd: Remove the parsing of dt info for tps65910 gpio
        mfd: Save device node parsed platform data for tps65910 sub devices
        mfd: Add r_select to lm3533 platform data
        gpio: Add Intel Centerton support to gpio-sch
        mfd: Emulate active low IRQs as well as active high IRQs for wm831x
        mfd: Mark two lm3533 zone registers as volatile
        mfd: Fix return type of lm533 attribute is_visible
        mfd: Enable Device Tree support in the ab8500-pwm driver
        mfd: Enable Device Tree support in the ab8500-sysctrl driver
        mfd: Add support for Device Tree to twl6040
        mfd: Register the twl6040 child for the ASoC codec unconditionally
        mfd: Allocate twl6040 IRQ numbers dynamically
        mfd: twl6040 code cleanup in interrupt initialization part
        mfd: Enable ab8500-gpadc driver for Device Tree
        mfd: Prevent unassigned pointer from being used in ab8500-gpadc driver
        ...
      4b781474
    • L
      Merge tag 'nfs-for-3.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 53f2c4a8
      Linus Torvalds 提交于
      Pull NFS client updates from Trond Myklebust:
       "New features include:
         - Rewrite the O_DIRECT code so that it can share the same coalescing
           and pNFS functionality as the page cache code.
         - Allow the server to provide hints as to when we should use pNFS,
           and when it is more efficient to read and write through the
           metadata server.
         - NFS cache consistency updates:
           * Use the ctime to emulate a change attribute for NFSv2/v3 so that
             all NFS versions can share the same cache management code.
           * New cache management code will only look at the change attribute
             and size attribute when deciding whether or not our cached data
             is still valid or not.
           * Don't request NFSv4 post-op attributes on writes in cases such as
             O_DIRECT, where we don't care about data cache consistency, or
             when we have a write delegation, and know that our cache is still
             consistent.
           * Don't request NFSv4 post-op attributes on operations such as
             COMMIT, where there are no expected metadata updates.
           * Don't request NFSv4 directory post-op attributes in cases where
             the operations themselves already return change attribute
             updates: i.e. operations such as OPEN, CREATE, REMOVE, LINK and
             RENAME.
         - Speed up 'ls' and friends by using READDIR rather than READDIRPLUS
           if we detect no attempts to lookup filenames.
         - Improve the code sharing between NFSv2/v3 and v4 mounts
         - NFSv4.1 state management efficiency improvements
         - More patches in preparation for NFSv4/v4.1 migration functionality."
      
      Fix trivial conflict in fs/nfs/nfs4proc.c that was due to the dcache
      qstr name initialization changes (that made the length/hash a 64-bit
      union)
      
      * tag 'nfs-for-3.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (146 commits)
        NFSv4: Add debugging printks to state manager
        NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO
        NFSv4: update_changeattr does not need to set NFS_INO_REVAL_PAGECACHE
        NFSv4.1: nfs4_reset_session should use nfs4_handle_reclaim_lease_error
        NFSv4.1: Handle other occurrences of NFS4ERR_CONN_NOT_BOUND_TO_SESSION
        NFSv4.1: Handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION in the state manager
        NFSv4.1: Handle errors in nfs4_bind_conn_to_session
        NFSv4.1: nfs4_bind_conn_to_session should drain the session
        NFSv4.1: Don't clobber the seqid if exchange_id returns a confirmed clientid
        NFSv4.1: Add DESTROY_CLIENTID
        NFSv4.1: Ensure we use the correct credentials for bind_conn_to_session
        NFSv4.1: Ensure we use the correct credentials for session create/destroy
        NFSv4.1: Move NFSPROC4_CLNT_BIND_CONN_TO_SESSION to the end of the operations
        NFSv4.1: Handle NFS4ERR_SEQ_MISORDERED when confirming the lease
        NFSv4: When purging the lease, we must clear NFS4CLNT_LEASE_CONFIRM
        NFSv4: Clean up the error handling for nfs4_reclaim_lease
        NFSv4.1: Exchange ID must use GFP_NOFS allocation mode
        nfs41: Use BIND_CONN_TO_SESSION for CB_PATH_DOWN*
        nfs4.1: add BIND_CONN_TO_SESSION operation
        NFSv4.1 test the mdsthreshold hint parameters
        ...
      53f2c4a8
    • A
      tty: fix ldisc lock inversion trace · 8f6576ad
      Alan Cox 提交于
      This is caused by tty_release using tty_lock_pair to lock both sides of
      the pty/tty pair, and then tty_ldisc_release dropping and relocking one
      side only.  We can drop both fine, so drop both to avoid any lock
      ordering concerns.
      
      Rework the release path to fix the new locking model.
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6576ad
    • A
      pty: Fix lock inversion · d3ca8b64
      Alan Cox 提交于
      The ptmx_open path takes the tty and devpts locks in the wrong order
      because tty_init_dev locks and returns a locked tty.  As far as I can
      tell this is actually safe anyway because the tty being returned is new
      so nobody can get a reference to lock it at this point.
      
      However we don't even need the devpts lock at this point, it's only held
      as a byproduct of the way the locks were pushe down.
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3ca8b64
  2. 29 5月, 2012 5 次提交