1. 16 11月, 2017 2 次提交
  2. 14 10月, 2017 1 次提交
  3. 09 9月, 2017 4 次提交
  4. 19 8月, 2017 1 次提交
    • Z
      mm/mempolicy: fix use after free when calling get_mempolicy · 73223e4e
      zhong jiang 提交于
      I hit a use after free issue when executing trinity and repoduced it
      with KASAN enabled.  The related call trace is as follows.
      
        BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
        Read of size 2 by task syz-executor1/798
      
        INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
           __slab_alloc+0x768/0x970
           kmem_cache_alloc+0x2e7/0x450
           mpol_new.part.2+0x74/0x160
           mpol_new+0x66/0x80
           SyS_mbind+0x267/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
           __slab_free+0x495/0x8e0
           kmem_cache_free+0x2f3/0x4c0
           __mpol_put+0x2b/0x40
           SyS_mbind+0x383/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
        INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600
      
        Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
        Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
        Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
        Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
        Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
        Memory state around the buggy address:
        ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc
      
      !shared memory policy is not protected against parallel removal by other
      thread which is normally protected by the mmap_sem.  do_get_mempolicy,
      however, drops the lock midway while we can still access it later.
      
      Early premature up_read is a historical artifact from times when
      put_user was called in this path see https://lwn.net/Articles/124754/
      but that is gone since 8bccd85f ("[PATCH] Implement sys_* do_*
      layering in the memory policy layer.").  but when we have the the
      current mempolicy ref count model.  The issue was introduced
      accordingly.
      
      Fix the issue by removing the premature release.
      
      Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[2.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73223e4e
  5. 13 7月, 2017 1 次提交
  6. 07 7月, 2017 4 次提交
    • V
      mm, mempolicy: don't check cpuset seqlock where it doesn't matter · e0dd7d53
      Vlastimil Babka 提交于
      Two wrappers of __alloc_pages_nodemask() are checking
      task->mems_allowed_seq themselves to retry allocation that has raced
      with a cpuset update.
      
      This has been shown to be ineffective in preventing premature OOM's
      which can happen in __alloc_pages_slowpath() long before it returns back
      to the wrappers to detect the race at that level.
      
      Previous patches have made __alloc_pages_slowpath() more robust, so we
      can now simply remove the seqlock checking in the wrappers to prevent
      further wrong impression that it can actually help.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-7-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0dd7d53
    • V
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
    • V
      mm, page_alloc: pass preferred nid instead of zonelist to allocator · 04ec6264
      Vlastimil Babka 提交于
      The main allocator function __alloc_pages_nodemask() takes a zonelist
      pointer as one of its parameters.  All of its callers directly or
      indirectly obtain the zonelist via node_zonelist() using a preferred
      node id and gfp_mask.  We can make the code a bit simpler by doing the
      zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
      id instead (gfp_mask is already another parameter).
      
      There are some code size benefits thanks to removal of inlined
      node_zonelist():
      
        bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
      
      This will also make things simpler if we proceed with converting cpusets
      to zonelists.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04ec6264
    • V
      mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() · 45816682
      Vlastimil Babka 提交于
      The task->il_next variable stores the next allocation node id for task's
      MPOL_INTERLEAVE policy.  mpol_rebind_nodemask() updates interleave and
      bind mempolicies due to changing cpuset mems.  Currently it also tries
      to make sure that current->il_next is valid within the updated nodemask.
      This is bogus, because 1) we are updating potentially any task's
      mempolicy, not just current, and 2) we might be updating a per-vma
      mempolicy, not task one.
      
      The interleave_nodes() function that uses il_next can cope fine with the
      value not being within the currently allowed nodes, so this hasn't
      manifested as an actual issue.
      
      We can remove the need for updating il_next completely by changing it to
      il_prev and store the node id of the previous interleave allocation
      instead of the next id.  Then interleave_nodes() can calculate the next
      id using the current nodemask and also store it as il_prev, except when
      querying the next node via do_get_mempolicy().
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45816682
  7. 09 4月, 2017 1 次提交
  8. 02 3月, 2017 3 次提交
  9. 25 1月, 2017 1 次提交
  10. 25 12月, 2016 1 次提交
  11. 13 12月, 2016 3 次提交
  12. 19 10月, 2016 1 次提交
  13. 08 10月, 2016 1 次提交
  14. 02 9月, 2016 1 次提交
  15. 29 7月, 2016 1 次提交
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  16. 27 7月, 2016 2 次提交
  17. 20 5月, 2016 3 次提交
  18. 18 3月, 2016 1 次提交
  19. 16 3月, 2016 1 次提交
  20. 10 3月, 2016 1 次提交
  21. 16 2月, 2016 1 次提交
    • D
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen 提交于
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4edcf0d
  22. 06 2月, 2016 1 次提交
    • K
      mempolicy: do not try to queue pages from !vma_migratable() · 77bf45e7
      Kirill A. Shutemov 提交于
      Maybe I miss some point, but I don't see a reason why we try to queue
      pages from non migratable VMAs.
      
      This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <numaif.h>
      
          #define SIZE 0x2000
      
          int foo;
      
          int main()
          {
              int fd;
              char *p;
              unsigned long mask = 2;
      
              fd = open("/dev/sg0", O_RDWR);
              p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
              /* Faultin pages */
              foo = p[0] + p[0x1000];
              mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
              return 0;
          }
      
      The only case when we can queue pages from such VMA is MPOL_MF_STRICT
      plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
      but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
      in vma_migratable()).  That's looks like a bug to me.
      
      Let's filter out non-migratable vma at start of queue_pages_test_walk()
      and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
      MPOL_MF_MOVE_ALL flag is set.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77bf45e7
  23. 16 1月, 2016 3 次提交
  24. 15 1月, 2016 1 次提交
    • N
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer 提交于
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5