1. 20 12月, 2014 4 次提交
  2. 19 12月, 2014 6 次提交
  3. 17 12月, 2014 3 次提交
  4. 14 12月, 2014 27 次提交
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • C
      percpu: remove __get_cpu_var and __raw_get_cpu_var macros · 6c51ec4d
      Christoph Lameter 提交于
      No user is left in the kernel source tree.  Therefore we can drop the
      definitions.
      
      This is the final merge of the transition away from __get_cpu_var.  After
      this patch the kernel will not build if anyone uses __get_cpu_var.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c51ec4d
    • J
      fsnotify: remove destroy_list from fsnotify_mark · 37d469e7
      Jan Kara 提交于
      destroy_list is used to track marks which still need waiting for srcu
      period end before they can be freed.  However by the time mark is added to
      destroy_list it isn't in group's list of marks anymore and thus we can
      reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
      two pointers for each fsnotify_mark.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d469e7
    • J
      fsnotify: unify inode and mount marks handling · 0809ab69
      Jan Kara 提交于
      There's a lot of common code in inode and mount marks handling.  Factor it
      out to a common helper function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0809ab69
    • M
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul 提交于
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
    • M
      ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM · e843e7d2
      Manfred Spraul 提交于
      a)
      
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: Increase the sysv sem limits so that all known applications
      will work with these defaults.
      
      b)
      
      With regards to the maximum supported:
      Some of the specified hard limits are not correct anymore, therefore the
      patch updates the documentation.
      
      - SEMMNI must stay below IPCMNI, which is 32768.
        As for SHMMAX: Stay a bit below this limit.
      
      - SEMMSL was limited to 8k, to ensure that the kmalloc for the kernel array
        was limited to 16 kB (order=2)
      
        This doesn't apply anymore:
         - the allocation size isn't sizeof(short)*nsems anymore.
         - ipc_alloc falls back to vmalloc
      
      - SEMOPM should stay below 1000, to limit the kmalloc in semtimedop() to an
        order=1 allocation.
        Therefore: Leave it at 500 (order=0 allocation).
      
      Note:
      If an administrator must limit the memory allocations, then he can set the
      values as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e843e7d2
    • D
      fault-inject: add ratelimit option · 6adc4a22
      Dmitry Monakhov 提交于
      Current debug levels are not optimal.  Especially if one want to provoke
      big numbers of faults(broken device simulator) then any verbose level will
      produce giant numbers of identical logging messages.  Let's add ratelimit
      parameter for that purpose.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Acked-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adc4a22
    • D
      ratelimit: add initialization macro · 89e3f909
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e3f909
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
    • O
      oom: don't assume that a coredumping thread will exit soon · d003f371
      Oleg Nesterov 提交于
      oom_kill.c assumes that PF_EXITING task should exit and free the memory
      soon.  This is wrong in many ways and one important case is the coredump.
      A task can sleep in exit_mm() "forever" while the coredumping sub-thread
      can need more memory.
      
      Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
      we add the new trivial helper for that.
      
      Note: this is only the first step, this patch doesn't try to solve other
      problems.  The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
      participate in coredump after it was already observed in PF_EXITING state,
      so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
      fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
      out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
       And even the name/usage of the new helper is confusing, an exiting thread
      can only free its ->mm if it is the only/last task in thread group.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d003f371
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • D
      mm,vmacache: count number of system-wide flushes · f5f302e2
      Davidlohr Bueso 提交于
      These flushes deal with sequence number overflows, such as for long lived
      threads.  These are rare, but interesting from a debugging PoV.  As such,
      display the number of flushes when vmacache debugging is enabled.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f302e2
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
    • J
      stacktrace: introduce snprint_stack_trace for buffer output · 9a92a6ce
      Joonsoo Kim 提交于
      Current stacktrace only have the function for console output.  page_owner
      that will be introduced in following patch needs to print the output of
      stacktrace into the buffer for our own output format so so new function,
      snprint_stack_trace(), is needed.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a92a6ce
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • J
      mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE · 2d48366b
      Jianyu Zhan 提交于
      GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
      defined, also implied by their names:
      
      GFP_USER                                  = GFP_USER
      GFP_USER + __GFP_HIGHMEM                  = GFP_HIGHUSER
      GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE  = GFP_HIGHUSER_MOVABLE
      
      So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
      reflect this fact.  It also makes the definition clear and texturally warn
      on any furture break-up of this escalated relastionship.
      Signed-off-by: NJianyu Zhan <jianyu.zhan@emc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d48366b
    • A
      include/linux/kmemleak.h: needs slab.h · 66f2ca7e
      Andrew Morton 提交于
      include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
      include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)
      
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f2ca7e
    • Z
      mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() · 056b7cce
      Zhang Zhen 提交于
      The gfp was passed in but never used in this function.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      056b7cce
    • T
      mm: move swp_entry_t definition to include/linux/mm_types.h · bd6dace7
      Tejun Heo 提交于
      swp_entry_t being defined in include/linux/swap.h instead of
      include/linux/mm_types.h causes cyclic include dependency later when
      include/linux/page_cgroup.h is included from writeback path.  Move the
      definition to include/linux/mm_types.h.
      
      While at it, reformat the comment above it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd6dace7
    • V
      memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29
      Vladimir Davydov 提交于
      It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
      the task_struct.
      
      Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
      set/clear the flag inline.  Regarding the overwhelming comment to the
      helpers, which is removed by this patch too, we already have a compact yet
      accurate explanation in memcg_schedule_cache_create, no need in yet
      another one.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f185c29
    • M
      lib: bitmap: add alignment offset for bitmap_find_next_zero_area() · 5e19b013
      Michal Nazarewicz 提交于
      Add a bitmap_find_next_zero_area_off() function which works like
      bitmap_find_next_zero_area() function except it allows an offset to be
      specified when alignment is checked.  This lets caller request a bit such
      that its number plus the offset is aligned according to the mask.
      
      [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kukjin Kim <kgene.kim@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e19b013
    • D
      mm/rmap: share the i_mmap_rwsem · 3dec0ba0
      Davidlohr Bueso 提交于
      Similarly to the anon memory counterpart, we can share the mapping's lock
      ownership as the interval tree is not modified when doing doing the walk,
      only the file page.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec0ba0
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621
      Davidlohr Bueso 提交于
      This series is a continuation of the conversion of the i_mmap_mutex to
      rwsem, following what we have for the anon memory counterpart.  With
      Hugh's feedback from the first iteration.
      
      Ultimately, the most obvious paths that require exclusive ownership of the
      lock is when we modify the VMA interval tree, via
      vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
      such as unmapping, where the ptes content is changed but the tree remains
      untouched should make it safe to share the i_mmap_rwsem.
      
      As such, the code of course is straightforward, however the devil is very
      much in the details.  While its been tested on a number of workloads
      without anything exploding, I would not be surprised if there are some
      less documented/known assumptions about the lock that could suffer from
      these changes.  Or maybe I'm just missing something, but either way I
      believe its at the point where it could use more eyes and hopefully some
      time in linux-next.
      
      Because the lock type conversion is the heart of this patchset,
      its worth noting a few comparisons between mutex vs rwsem (xadd):
      
        (i) Same size, no extra footprint.
      
        (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
             exclusive lock ownership.
      
        (iii) Both can be slightly unfair wrt exclusive ownership, with
              writer lock stealing properties, not necessarily respecting
              FIFO order for granting the lock when contended.
      
        (iv) Mutexes can be slightly faster than rwsems when
             the lock is non-contended.
      
        (v) Both suck at performance for debug (slowpaths), which
            shouldn't matter anyway.
      
      Sharing the lock is obviously beneficial, and sem writer ownership is
      close enough to mutexes.  The biggest winner of these changes is
      migration.
      
      As for concrete numbers, the following performance results are for a
      4-socket 60-core IvyBridge-EX with 130Gb of RAM.
      
      Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
      with this set, with a steady ~60% throughput (jpm) increase for alltests
      and up to ~30% for disk for high amounts of concurrency.  Lower counts of
      workload users (< 100) does not show much difference at all, so at least
      no regressions.
      
                          3.18-rc1            3.18-rc1-i_mmap_rwsem
      alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
      alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
      alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
      alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
      alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
      alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
      alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
      alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
      alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
      alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
      alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
      alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
      alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
      alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
      alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
      alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
      alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
      alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
      alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
      alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)
      
      disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
      disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
      disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
      disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
      disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
      disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
      disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
      disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
      disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
      disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
      disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
      disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
      disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
      disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
      disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
      disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
      disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
      disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
      disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
      disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)
      
      In addition, a 67.5% increase in successfully migrated NUMA pages, thus
      improving node locality.
      
      The patch layout is simple but designed for bisection (in case reversion
      is needed if the changes break upstream) and easier review:
      
      o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
      o Patches 5-10 share the lock in specific paths, each patch
        details the rationale behind why it should be safe.
      
      This patchset has been tested with: postgres 9.4 (with brand new hugetlb
      support), hugetlbfs test suite (all tests pass, in fact more tests pass
      with these changes than with an upstream kernel), ltp, aim7 benchmarks,
      memcached and iozone with the -B option for mmap'ing.  *Untested* paths
      are nommu, memory-failure, uprobes and xip.
      
      This patch (of 8):
      
      Various parts of the kernel acquire and release this mutex, so add
      i_mmap_lock_write() and immap_unlock_write() helper functions that will
      encapsulate this logic.  The next patch will make use of these.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b28f621