1. 28 10月, 2016 9 次提交
  2. 25 10月, 2016 1 次提交
    • L
      mm: unexport __get_user_pages() · 0d731759
      Lorenzo Stoakes 提交于
      This patch unexports the low-level __get_user_pages() function.
      
      Recent refactoring of the get_user_pages* functions allow flags to be
      passed through get_user_pages() which eliminates the need for access to
      this function from its one user, kvm.
      
      We can see that the two calls to get_user_pages() which replace
      __get_user_pages() in kvm_main.c are equivalent by examining their call
      stacks:
      
        get_user_page_nowait():
          get_user_pages(start, 1, flags, page, NULL)
          __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, start, 1,
      		     flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
      
        check_user_page_hwpoison():
          get_user_pages(addr, 1, flags, NULL, NULL)
          __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
      		     NULL, NULL)
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d731759
  3. 20 10月, 2016 1 次提交
  4. 19 10月, 2016 12 次提交
  5. 16 10月, 2016 1 次提交
    • D
      kprobes: Unpoison stack in jprobe_return() for KASAN · 9f7d416c
      Dmitry Vyukov 提交于
      I observed false KSAN positives in the sctp code, when
      sctp uses jprobe_return() in jsctp_sf_eat_sack().
      
      The stray 0xf4 in shadow memory are stack redzones:
      
      [     ] ==================================================================
      [     ] BUG: KASAN: stack-out-of-bounds in memcmp+0xe9/0x150 at addr ffff88005e48f480
      [     ] Read of size 1 by task syz-executor/18535
      [     ] page:ffffea00017923c0 count:0 mapcount:0 mapping:          (null) index:0x0
      [     ] flags: 0x1fffc0000000000()
      [     ] page dumped because: kasan: bad access detected
      [     ] CPU: 1 PID: 18535 Comm: syz-executor Not tainted 4.8.0+ #28
      [     ] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      [     ]  ffff88005e48f2d0 ffffffff82d2b849 ffffffff0bc91e90 fffffbfff10971e8
      [     ]  ffffed000bc91e90 ffffed000bc91e90 0000000000000001 0000000000000000
      [     ]  ffff88005e48f480 ffff88005e48f350 ffffffff817d3169 ffff88005e48f370
      [     ] Call Trace:
      [     ]  [<ffffffff82d2b849>] dump_stack+0x12e/0x185
      [     ]  [<ffffffff817d3169>] kasan_report+0x489/0x4b0
      [     ]  [<ffffffff817d31a9>] __asan_report_load1_noabort+0x19/0x20
      [     ]  [<ffffffff82d49529>] memcmp+0xe9/0x150
      [     ]  [<ffffffff82df7486>] depot_save_stack+0x176/0x5c0
      [     ]  [<ffffffff817d2031>] save_stack+0xb1/0xd0
      [     ]  [<ffffffff817d27f2>] kasan_slab_free+0x72/0xc0
      [     ]  [<ffffffff817d05b8>] kfree+0xc8/0x2a0
      [     ]  [<ffffffff85b03f19>] skb_free_head+0x79/0xb0
      [     ]  [<ffffffff85b0900a>] skb_release_data+0x37a/0x420
      [     ]  [<ffffffff85b090ff>] skb_release_all+0x4f/0x60
      [     ]  [<ffffffff85b11348>] consume_skb+0x138/0x370
      [     ]  [<ffffffff8676ad7b>] sctp_chunk_put+0xcb/0x180
      [     ]  [<ffffffff8676ae88>] sctp_chunk_free+0x58/0x70
      [     ]  [<ffffffff8677fa5f>] sctp_inq_pop+0x68f/0xef0
      [     ]  [<ffffffff8675ee36>] sctp_assoc_bh_rcv+0xd6/0x4b0
      [     ]  [<ffffffff8677f2c1>] sctp_inq_push+0x131/0x190
      [     ]  [<ffffffff867bad69>] sctp_backlog_rcv+0xe9/0xa20
      [ ... ]
      [     ] Memory state around the buggy address:
      [     ]  ffff88005e48f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [     ]  ffff88005e48f400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [     ] >ffff88005e48f480: f4 f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [     ]                    ^
      [     ]  ffff88005e48f500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [     ]  ffff88005e48f580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [     ] ==================================================================
      
      KASAN stack instrumentation poisons stack redzones on function entry
      and unpoisons them on function exit. If a function exits abnormally
      (e.g. with a longjmp like jprobe_return()), stack redzones are left
      poisoned. Later this leads to random KASAN false reports.
      
      Unpoison stack redzones in the frames we are going to jump over
      before doing actual longjmp in jprobe_return().
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: kasan-dev@googlegroups.com
      Cc: surovegin@google.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/1476454043-101898-1-git-send-email-dvyukov@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9f7d416c
  6. 13 10月, 2016 1 次提交
    • L
      Disable the __builtin_return_address() warning globally after all · ef6000b4
      Linus Torvalds 提交于
      This affectively reverts commit 377ccbb4 ("Makefile: Mute warning
      for __builtin_return_address(>0) for tracing only") because it turns out
      that it really isn't tracing only - it's all over the tree.
      
      We already also had the warning disabled separately for mm/usercopy.c
      (which this commit also removes), and it turns out that we will also
      want to disable it for get_lock_parent_ip(), that is used for at least
      TRACE_IRQFLAGS.  Which (when enabled) ends up being all over the tree.
      
      Steven Rostedt had a patch that tried to limit it to just the config
      options that actually triggered this, but quite frankly, the extra
      complexity and abstraction just isn't worth it.  We have never actually
      had a case where the warning is actually useful, so let's just disable
      it globally and not worry about it.
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef6000b4
  7. 12 10月, 2016 1 次提交
  8. 11 10月, 2016 3 次提交
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
    • E
      gcc-plugins: Add latent_entropy plugin · 38addce8
      Emese Revfy 提交于
      This adds a new gcc plugin named "latent_entropy". It is designed to
      extract as much possible uncertainty from a running system at boot time as
      possible, hoping to capitalize on any possible variation in CPU operation
      (due to runtime data differences, hardware differences, SMP ordering,
      thermal timing variation, cache behavior, etc).
      
      At the very least, this plugin is a much more comprehensive example for
      how to manipulate kernel code using the gcc plugin internals.
      
      The need for very-early boot entropy tends to be very architecture or
      system design specific, so this plugin is more suited for those sorts
      of special cases. The existing kernel RNG already attempts to extract
      entropy from reliable runtime variation, but this plugin takes the idea to
      a logical extreme by permuting a global variable based on any variation
      in code execution (e.g. a different value (and permutation function)
      is used to permute the global based on loop count, case statement,
      if/then/else branching, etc).
      
      To do this, the plugin starts by inserting a local variable in every
      marked function. The plugin then adds logic so that the value of this
      variable is modified by randomly chosen operations (add, xor and rol) and
      random values (gcc generates separate static values for each location at
      compile time and also injects the stack pointer at runtime). The resulting
      value depends on the control flow path (e.g., loops and branches taken).
      
      Before the function returns, the plugin mixes this local variable into
      the latent_entropy global variable. The value of this global variable
      is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
      though it does not credit any bytes of entropy to the pool; the contents
      of the global are just used to mix the pool.
      
      Additionally, the plugin can pre-initialize arrays with build-time
      random contents, so that two different kernel builds running on identical
      hardware will not have the same starting values.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message and code comments]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      38addce8
    • A
      fix ITER_PIPE interaction with direct_IO · c3a69024
      Al Viro 提交于
      by making sure we call iov_iter_advance() on original
      iov_iter even if direct_IO (done on its copy) has returned 0.
      It's a no-op for old iov_iter flavours and does the right thing
      (== truncation of the stuff we'd allocated, but not filled) in
      ITER_PIPE case.  Failures (e.g. -EIO) get caught and dealt with
      by cleanup in generic_file_read_iter().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3a69024
  9. 08 10月, 2016 11 次提交
    • A
      vfs: Remove {get,set,remove}xattr inode operations · fd50ecad
      Andreas Gruenbacher 提交于
      These inode operations are no longer used; remove them.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fd50ecad
    • J
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches 提交于
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • A
      proc: much faster /proc/vmstat · 68ba0326
      Alexey Dobriyan 提交于
      Every current KDE system has process named ksysguardd polling files
      below once in several seconds:
      
      	$ strace -e trace=open -p $(pidof ksysguardd)
      	Process 1812 attached
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/proc/net/dev", O_RDONLY)         = 8
      	open("/proc/net/wireless", O_RDONLY)    = -1 ENOENT (No such file or directory)
      	open("/proc/stat", O_RDONLY)            = 8
      	open("/proc/vmstat", O_RDONLY)          = 8
      
      Hell knows what it is doing but speed up reading /proc/vmstat by 33%!
      
      Benchmark is open+read+close 1.000.000 times.
      
      			BEFORE
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
            13146.768464      task-clock (msec)         #    0.960 CPUs utilized            ( +-  0.60% )
                      15      context-switches          #    0.001 K/sec                    ( +-  1.41% )
                       1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
                     104      page-faults               #    0.008 K/sec                    ( +-  0.57% )
          45,489,799,349      cycles                    #    3.460 GHz                      ( +-  0.03% )
           9,970,175,743      stalled-cycles-frontend   #   21.92% frontend cycles idle     ( +-  0.10% )
           2,800,298,015      stalled-cycles-backend    #   6.16% backend cycles idle       ( +-  0.32% )
          79,241,190,850      instructions              #    1.74  insn per cycle
                                                        #    0.13  stalled cycles per insn  ( +-  0.00% )
          17,616,096,146      branches                  # 1339.956 M/sec                    ( +-  0.00% )
             176,106,232      branch-misses             #    1.00% of all branches          ( +-  0.18% )
      
            13.691078109 seconds time elapsed                                          ( +-  0.03% )
            ^^^^^^^^^^^^
      
      			AFTER
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
             8688.353749      task-clock (msec)         #    0.950 CPUs utilized            ( +-  1.25% )
                      10      context-switches          #    0.001 K/sec                    ( +-  2.13% )
                       1      cpu-migrations            #    0.000 K/sec
                     104      page-faults               #    0.012 K/sec                    ( +-  0.56% )
          30,384,010,730      cycles                    #    3.497 GHz                      ( +-  0.07% )
          12,296,259,407      stalled-cycles-frontend   #   40.47% frontend cycles idle     ( +-  0.13% )
           3,370,668,651      stalled-cycles-backend    #  11.09% backend cycles idle       ( +-  0.69% )
          28,969,052,879      instructions              #    0.95  insn per cycle
                                                        #    0.42  stalled cycles per insn  ( +-  0.01% )
           6,308,245,891      branches                  #  726.058 M/sec                    ( +-  0.00% )
             214,685,502      branch-misses             #    3.40% of all branches          ( +-  0.26% )
      
             9.146081052 seconds time elapsed                                          ( +-  0.07% )
             ^^^^^^^^^^^
      
      vsnprintf() is slow because:
      
      1. format_decode() is busy looking for format specifier: 2 branches
         per character (not in this case, but in others)
      
      2. approximately million branches while parsing format mini language
         and everywhere
      
      3.  just look at what string() does /proc/vmstat is good case because
         most of its content are strings
      
      Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ba0326
    • Z
      mm: remove unnecessary condition in remove_inode_hugepages · 72e2936c
      zhong jiang 提交于
      When the huge page is added to the page cahce (huge_add_to_page_cache),
      the page private flag will be cleared.  since this code
      (remove_inode_hugepages) will only be called for pages in the page
      cahce, PagePrivate(page) will always be false.
      
      The patch remove the code without any functional change.
      
      Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72e2936c
    • M
      mm: warn about allocations which stall for too long · 63f53dea
      Michal Hocko 提交于
      Currently we do warn only about allocation failures but small
      allocations are basically nofail and they might loop in the page
      allocator for a long time.  Especially when the reclaim cannot make any
      progress - e.g.  GFP_NOFS cannot invoke the oom killer and rely on a
      different context to make a forward progress in case there is a lot
      memory used by filesystems.
      
      Give us at least a clue when something like this happens and warn about
      allocations which take more than 10s.  Print the basic allocation
      context information along with the cumulative time spent in the
      allocation as well as the allocation stack.  Repeat the warning after
      every 10 seconds so that we know that the problem is permanent rather
      than ephemeral.
      
      Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63f53dea
    • M
      mm: consolidate warn_alloc_failed users · 7877cdcc
      Michal Hocko 提交于
      warn_alloc_failed is currently used from the page and vmalloc
      allocators.  This is a good reuse of the code except that vmalloc would
      appreciate a slightly different warning message.  This is already
      handled by the fmt parameter except that
      
        "%s: page allocation failure: order:%u, mode:%#x(%pGg)"
      
      is printed anyway.  This might be quite misleading because it might be a
      vmalloc failure which leads to the warning while the page allocator is
      not the culprit here.  Fix this by always using the fmt string and only
      print the context that makes sense for the particular context (e.g.
      order makes only very little sense for the vmalloc context).
      
      Rename the function to not miss any user and also because a later patch
      will reuse it also for !failure cases.
      
      Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7877cdcc
    • W
      vfs,mm: fix a dead loop in truncate_inode_pages_range() · c2a9737f
      Wei Fang 提交于
      We triggered a deadloop in truncate_inode_pages_range() on 32 bits
      architecture with the test case bellow:
      
      	...
      	fd = open();
      	write(fd, buf, 4096);
      	preadv64(fd, &iovec, 1, 0xffffffff000);
      	ftruncate(fd, 0);
      	...
      
      Then ftruncate() will not return forever.
      
      The filesystem used in this case is ubifs, but it can be triggered on
      many other filesystems.
      
      When preadv64() is called with offset=0xffffffff000, a page with
      index=0xffffffff will be added to the radix tree of ->mapping.  Then
      this page can be found in ->mapping with pagevec_lookup().  After that,
      truncate_inode_pages_range(), which is called in ftruncate(), will fall
      into an infinite loop:
      
       - find a page with index=0xffffffff, since index>=end, this page won't
         be truncated
      
       - index++, and index become 0
      
       - the page with index=0xffffffff will be found again
      
      The data type of index is unsigned long, so index won't overflow to 0 on
      64 bits architecture in this case, and the dead loop won't happen.
      
      Since truncate_inode_pages_range() is executed with holding lock of
      inode->i_rwsem, any operation related with this lock will be blocked,
      and a hung task will happen, e.g.:
      
        INFO: task truncate_test:3364 blocked for more than 120 seconds.
        ...
           call_rwsem_down_write_failed+0x17/0x30
           generic_file_write_iter+0x32/0x1c0
           ubifs_write_iter+0xcc/0x170
           __vfs_write+0xc4/0x120
           vfs_write+0xb2/0x1b0
           SyS_write+0x46/0xa0
      
      The page with index=0xffffffff added to ->mapping is useless.  Fix this
      by checking the read position before allocating pages.
      
      Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.comSigned-off-by: NWei Fang <fangwei1@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2a9737f
    • Y
      mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE · 461a7184
      Yisheng Xie 提交于
      Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
      page.  No functional change with this patch.
      
      Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      461a7184
    • M
      oom: print nodemask in the oom report · 82e7d3ab
      Michal Hocko 提交于
      We have received a hard to explain oom report from a customer.  The oom
      triggered regardless there is a lot of free memory:
      
        PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
        PoolThread cpuset=/ mems_allowed=0-7
        Pid: 30055, comm: PoolThread Tainted: G           E X 3.0.101-80-default #1
        Call Trace:
          dump_trace+0x75/0x300
          dump_stack+0x69/0x6f
          dump_header+0x8e/0x110
          oom_kill_process+0xa6/0x350
          out_of_memory+0x2b7/0x310
          __alloc_pages_slowpath+0x7dd/0x820
          __alloc_pages_nodemask+0x1e9/0x200
          alloc_pages_vma+0xe1/0x290
          do_anonymous_page+0x13e/0x300
          do_page_fault+0x1fd/0x4c0
          page_fault+0x25/0x30
        [...]
        active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
         active_file:13093 inactive_file:222506 isolated_file:0
         unevictable:262144 dirty:2 writeback:0 unstable:0
         free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
         mapped:261139 shmem:166297 pagetables:2228282 bounce:0
        [...]
        Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 2892 775542 775542
        Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 772650 772650
        Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
      
      So all nodes but Node 0 have a lot of free memory which should suggest
      that there is an available memory especially when mems_allowed=0-7.  One
      could speculate that a massive process has managed to terminate and free
      up a lot of memory while racing with the above allocation request.
      Although this is highly unlikely it cannot be ruled out.
      
      A further debugging, however shown that the faulting process had
      mempolicy (not cpuset) to bind to Node 0.  We cannot see that
      information from the report though.  mems_allowed turned out to be more
      confusing than really helpful.
      
      Fix this by always priting the nodemask.  It is either mempolicy mask
      (and non-null) or the one defined by the cpusets.  The new output for
      the above oom report would be
      
        PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0
      
      This patch doesn't touch show_mem and the node filtering based on the
      cpuset node mask because mempolicy is always a subset of cpusets and
      seeing the full cpuset oom context might be helpful for tunning more
      specific mempolicies inside cpusets (e.g.  when they turn out to be too
      restrictive).  To prevent from ugly ifdefs the mask is printed even for
      !NUMA configurations but this should be OK (a single node will be
      printed).
      
      Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NSellami Abdelkader <abdelkader.sellami@sap.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sellami Abdelkader <abdelkader.sellami@sap.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e7d3ab
    • K
    • A
      mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb · 8f26e0b1
      Andrea Arcangeli 提交于
      The old code was always doing:
      
         vma->vm_end = next->vm_end
         vma_rb_erase(next) // in __vma_unlink
         vma->vm_next = next->vm_next // in __vma_unlink
         next = vma->vm_next
         vma_gap_update(next)
      
      The new code still does the above for remove_next == 1 and 2, but for
      remove_next == 3 it has been changed and it does:
      
         next->vm_start = vma->vm_start
         vma_rb_erase(vma) // in __vma_unlink
         vma_gap_update(next)
      
      In the latter case, while unlinking "vma", validate_mm_rb() is told to
      ignore "vma" that is being removed, but next->vm_start was reduced
      instead. So for the new case, to avoid the false positive from
      validate_mm_rb, it should be "next" that is ignored when "vma" is
      being unlinked.
      
      "vma" and "next" in the above comment, considered pre-swap().
      
      Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NShaun Tancheff <shaun.tancheff@seagate.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f26e0b1