1. 15 12月, 2014 2 次提交
  2. 14 12月, 2014 27 次提交
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • C
      percpu: remove __get_cpu_var and __raw_get_cpu_var macros · 6c51ec4d
      Christoph Lameter 提交于
      No user is left in the kernel source tree.  Therefore we can drop the
      definitions.
      
      This is the final merge of the transition away from __get_cpu_var.  After
      this patch the kernel will not build if anyone uses __get_cpu_var.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c51ec4d
    • J
      fsnotify: remove destroy_list from fsnotify_mark · 37d469e7
      Jan Kara 提交于
      destroy_list is used to track marks which still need waiting for srcu
      period end before they can be freed.  However by the time mark is added to
      destroy_list it isn't in group's list of marks anymore and thus we can
      reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
      two pointers for each fsnotify_mark.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d469e7
    • J
      fsnotify: unify inode and mount marks handling · 0809ab69
      Jan Kara 提交于
      There's a lot of common code in inode and mount marks handling.  Factor it
      out to a common helper function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0809ab69
    • M
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul 提交于
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
    • M
      ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM · e843e7d2
      Manfred Spraul 提交于
      a)
      
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: Increase the sysv sem limits so that all known applications
      will work with these defaults.
      
      b)
      
      With regards to the maximum supported:
      Some of the specified hard limits are not correct anymore, therefore the
      patch updates the documentation.
      
      - SEMMNI must stay below IPCMNI, which is 32768.
        As for SHMMAX: Stay a bit below this limit.
      
      - SEMMSL was limited to 8k, to ensure that the kmalloc for the kernel array
        was limited to 16 kB (order=2)
      
        This doesn't apply anymore:
         - the allocation size isn't sizeof(short)*nsems anymore.
         - ipc_alloc falls back to vmalloc
      
      - SEMOPM should stay below 1000, to limit the kmalloc in semtimedop() to an
        order=1 allocation.
        Therefore: Leave it at 500 (order=0 allocation).
      
      Note:
      If an administrator must limit the memory allocations, then he can set the
      values as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e843e7d2
    • D
      fault-inject: add ratelimit option · 6adc4a22
      Dmitry Monakhov 提交于
      Current debug levels are not optimal.  Especially if one want to provoke
      big numbers of faults(broken device simulator) then any verbose level will
      produce giant numbers of identical logging messages.  Let's add ratelimit
      parameter for that purpose.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Acked-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adc4a22
    • D
      ratelimit: add initialization macro · 89e3f909
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e3f909
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
    • O
      oom: don't assume that a coredumping thread will exit soon · d003f371
      Oleg Nesterov 提交于
      oom_kill.c assumes that PF_EXITING task should exit and free the memory
      soon.  This is wrong in many ways and one important case is the coredump.
      A task can sleep in exit_mm() "forever" while the coredumping sub-thread
      can need more memory.
      
      Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
      we add the new trivial helper for that.
      
      Note: this is only the first step, this patch doesn't try to solve other
      problems.  The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
      participate in coredump after it was already observed in PF_EXITING state,
      so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
      fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
      out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
       And even the name/usage of the new helper is confusing, an exiting thread
      can only free its ->mm if it is the only/last task in thread group.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d003f371
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • D
      mm,vmacache: count number of system-wide flushes · f5f302e2
      Davidlohr Bueso 提交于
      These flushes deal with sequence number overflows, such as for long lived
      threads.  These are rare, but interesting from a debugging PoV.  As such,
      display the number of flushes when vmacache debugging is enabled.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f302e2
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
    • J
      stacktrace: introduce snprint_stack_trace for buffer output · 9a92a6ce
      Joonsoo Kim 提交于
      Current stacktrace only have the function for console output.  page_owner
      that will be introduced in following patch needs to print the output of
      stacktrace into the buffer for our own output format so so new function,
      snprint_stack_trace(), is needed.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a92a6ce
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • J
      mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE · 2d48366b
      Jianyu Zhan 提交于
      GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
      defined, also implied by their names:
      
      GFP_USER                                  = GFP_USER
      GFP_USER + __GFP_HIGHMEM                  = GFP_HIGHUSER
      GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE  = GFP_HIGHUSER_MOVABLE
      
      So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
      reflect this fact.  It also makes the definition clear and texturally warn
      on any furture break-up of this escalated relastionship.
      Signed-off-by: NJianyu Zhan <jianyu.zhan@emc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d48366b
    • A
      include/linux/kmemleak.h: needs slab.h · 66f2ca7e
      Andrew Morton 提交于
      include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
      include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)
      
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f2ca7e
    • Z
      mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() · 056b7cce
      Zhang Zhen 提交于
      The gfp was passed in but never used in this function.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      056b7cce
    • T
      mm: move swp_entry_t definition to include/linux/mm_types.h · bd6dace7
      Tejun Heo 提交于
      swp_entry_t being defined in include/linux/swap.h instead of
      include/linux/mm_types.h causes cyclic include dependency later when
      include/linux/page_cgroup.h is included from writeback path.  Move the
      definition to include/linux/mm_types.h.
      
      While at it, reformat the comment above it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd6dace7
    • V
      memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29
      Vladimir Davydov 提交于
      It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
      the task_struct.
      
      Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
      set/clear the flag inline.  Regarding the overwhelming comment to the
      helpers, which is removed by this patch too, we already have a compact yet
      accurate explanation in memcg_schedule_cache_create, no need in yet
      another one.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f185c29
    • M
      lib: bitmap: add alignment offset for bitmap_find_next_zero_area() · 5e19b013
      Michal Nazarewicz 提交于
      Add a bitmap_find_next_zero_area_off() function which works like
      bitmap_find_next_zero_area() function except it allows an offset to be
      specified when alignment is checked.  This lets caller request a bit such
      that its number plus the offset is aligned according to the mask.
      
      [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kukjin Kim <kgene.kim@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e19b013
    • D
      mm/rmap: share the i_mmap_rwsem · 3dec0ba0
      Davidlohr Bueso 提交于
      Similarly to the anon memory counterpart, we can share the mapping's lock
      ownership as the interval tree is not modified when doing doing the walk,
      only the file page.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec0ba0
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621
      Davidlohr Bueso 提交于
      This series is a continuation of the conversion of the i_mmap_mutex to
      rwsem, following what we have for the anon memory counterpart.  With
      Hugh's feedback from the first iteration.
      
      Ultimately, the most obvious paths that require exclusive ownership of the
      lock is when we modify the VMA interval tree, via
      vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
      such as unmapping, where the ptes content is changed but the tree remains
      untouched should make it safe to share the i_mmap_rwsem.
      
      As such, the code of course is straightforward, however the devil is very
      much in the details.  While its been tested on a number of workloads
      without anything exploding, I would not be surprised if there are some
      less documented/known assumptions about the lock that could suffer from
      these changes.  Or maybe I'm just missing something, but either way I
      believe its at the point where it could use more eyes and hopefully some
      time in linux-next.
      
      Because the lock type conversion is the heart of this patchset,
      its worth noting a few comparisons between mutex vs rwsem (xadd):
      
        (i) Same size, no extra footprint.
      
        (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
             exclusive lock ownership.
      
        (iii) Both can be slightly unfair wrt exclusive ownership, with
              writer lock stealing properties, not necessarily respecting
              FIFO order for granting the lock when contended.
      
        (iv) Mutexes can be slightly faster than rwsems when
             the lock is non-contended.
      
        (v) Both suck at performance for debug (slowpaths), which
            shouldn't matter anyway.
      
      Sharing the lock is obviously beneficial, and sem writer ownership is
      close enough to mutexes.  The biggest winner of these changes is
      migration.
      
      As for concrete numbers, the following performance results are for a
      4-socket 60-core IvyBridge-EX with 130Gb of RAM.
      
      Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
      with this set, with a steady ~60% throughput (jpm) increase for alltests
      and up to ~30% for disk for high amounts of concurrency.  Lower counts of
      workload users (< 100) does not show much difference at all, so at least
      no regressions.
      
                          3.18-rc1            3.18-rc1-i_mmap_rwsem
      alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
      alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
      alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
      alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
      alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
      alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
      alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
      alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
      alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
      alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
      alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
      alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
      alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
      alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
      alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
      alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
      alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
      alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
      alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
      alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)
      
      disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
      disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
      disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
      disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
      disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
      disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
      disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
      disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
      disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
      disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
      disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
      disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
      disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
      disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
      disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
      disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
      disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
      disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
      disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
      disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)
      
      In addition, a 67.5% increase in successfully migrated NUMA pages, thus
      improving node locality.
      
      The patch layout is simple but designed for bisection (in case reversion
      is needed if the changes break upstream) and easier review:
      
      o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
      o Patches 5-10 share the lock in specific paths, each patch
        details the rationale behind why it should be safe.
      
      This patchset has been tested with: postgres 9.4 (with brand new hugetlb
      support), hugetlbfs test suite (all tests pass, in fact more tests pass
      with these changes than with an upstream kernel), ltp, aim7 benchmarks,
      memcached and iozone with the -B option for mmap'ing.  *Untested* paths
      are nommu, memory-failure, uprobes and xip.
      
      This patch (of 8):
      
      Various parts of the kernel acquire and release this mutex, so add
      i_mmap_lock_write() and immap_unlock_write() helper functions that will
      encapsulate this logic.  The next patch will make use of these.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b28f621
  3. 13 12月, 2014 2 次提交
    • S
      tracing/sched: Check preempt_count() for current when reading task->state · aee4e5f3
      Steven Rostedt (Red Hat) 提交于
      When recording the state of a task for the sched_switch tracepoint a check of
      task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is
      because, technically, a task being preempted is really in the TASK_RUNNING
      state, and that is what should be recorded when tracing a sched_switch,
      even if the task put itself into another state (it hasn't scheduled out
      in that state yet).
      
      But with the change to use per_cpu preempt counts, the
      task_thread_info(p)->preempt_count is no longer used, and instead
      task_preempt_count(p) is used.
      
      The problem is that this does not use the current preempt count but a stale
      one from a previous sched_switch. The task_preempt_count(p) uses
      saved_preempt_count and not preempt_count(). But for tracing sched_switch,
      if p is current, we really want preempt_count().
      
      I hit this bug when I was tracing sleep and the call from do_nanosleep()
      scheduled out in the "RUNNING" state.
      
                 sleep-4290  [000] 537272.259992: sched_switch:         sleep:4290 [120] R ==> swapper/0:0 [120]
                 sleep-4290  [000] 537272.260015: kernel_stack:         <stack trace>
      => __schedule (ffffffff8150864a)
      => schedule (ffffffff815089f8)
      => do_nanosleep (ffffffff8150b76c)
      => hrtimer_nanosleep (ffffffff8108d66b)
      => SyS_nanosleep (ffffffff8108d750)
      => return_to_handler (ffffffff8150e8e5)
      => tracesys_phase2 (ffffffff8150c844)
      
      After a bit of hair pulling, I found that the state was really
      TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE
      set and caused the sched_switch tracepoint to show it as RUNNING.
      
      Link: http://lkml.kernel.org/r/20141210174428.3cb7542a@gandalf.local.homeAcked-by: NIngo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org # 3.13+
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: 01028747 "sched: Create more preempt_count accessors"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      aee4e5f3
    • Q
  4. 12 12月, 2014 9 次提交
    • A
      arch: Add lightweight memory barriers dma_rmb() and dma_wmb() · 1077fa36
      Alexander Duyck 提交于
      There are a number of situations where the mandatory barriers rmb() and
      wmb() are used to order memory/memory operations in the device drivers
      and those barriers are much heavier than they actually need to be.  For
      example in the case of PowerPC wmb() calls the heavy-weight sync
      instruction when for coherent memory operations all that is really needed
      is an lsync or eieio instruction.
      
      This commit adds a coherent only version of the mandatory memory barriers
      rmb() and wmb().  In most cases this should result in the barrier being the
      same as the SMP barriers for the SMP case, however in some cases we use a
      barrier that is somewhere in between rmb() and smp_rmb().  For example on
      ARM the rmb barriers break down as follows:
      
        Barrier   Call     Explanation
        --------- -------- ----------------------------------
        rmb()     dsb()    Data synchronization barrier - system
        dma_rmb() dmb(osh) data memory barrier - outer sharable
        smp_rmb() dmb(ish) data memory barrier - inner sharable
      
      These new barriers are not as safe as the standard rmb() and wmb().
      Specifically they do not guarantee ordering between coherent and incoherent
      memories.  The primary use case for these would be to enforce ordering of
      reads and writes when accessing coherent memory that is shared between the
      CPU and a device.
      
      It may also be noted that there is no dma_mb().  Most architectures don't
      provide a good mechanism for performing a coherent only full barrier without
      resorting to the same mechanism used in mb().  As such there isn't much to
      be gained in trying to define such a function.
      
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1077fa36
    • T
      pstore-ram: Allow optional mapping with pgprot_noncached · 027bc8b0
      Tony Lindgren 提交于
      On some ARMs the memory can be mapped pgprot_noncached() and still
      be working for atomic operations. As pointed out by Colin Cross
      <ccross@android.com>, in some cases you do want to use
      pgprot_noncached() if the SoC supports it to see a debug printk
      just before a write hanging the system.
      
      On ARMs, the atomic operations on strongly ordered memory are
      implementation defined. So let's provide an optional kernel parameter
      for configuring pgprot_noncached(), and use pgprot_writecombine() by
      default.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: stable@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      027bc8b0
    • W
      i2c: core changes for slave support · 4b1acc43
      Wolfram Sang 提交于
      Finally(!), make Linux support being an I2C slave. Most of the existing
      infrastructure is reused. We mainly add i2c_slave_register/unregister()
      calls which tells i2c bus drivers to activate the slave mode. Then, they
      also get a callback to report slave events to.
      Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      4b1acc43
    • C
      ipmi: Remove the now unused priority from SMI sender · 99ab32f3
      Corey Minyard 提交于
      Since the queue was moved into the message handler, the priority
      field is now irrelevant.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      99ab32f3
    • C
      ipmi: Use the proper type for acpi_handle · a11213fc
      Corey Minyard 提交于
      Minor cleanup, don't use a void pointer, use the right type.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      a11213fc
    • C
      ipmi: Remove useless sysfs_name parameters · 5a0e10ec
      Corey Minyard 提交于
      It was always "bmc", so just hardcode it.  It makes no sense to
      pass that in.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      5a0e10ec
    • C
      ipmi: Move the address source to string to ipmi-generic code · 7e50387b
      Corey Minyard 提交于
      It was in the system interface driver, but is generic functionality.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      7e50387b
    • M
      net/mlx4: Add support for A0 steering · 7d077cd3
      Matan Barak 提交于
      Add the required firmware commands for A0 steering and a way to enable
      that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT,
      QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used
      to configure and query the device.
      
      The different A0 DMFS (steering) modes are:
      
      Static - optimized performance, but flow steering rules are
      limited. This mode should be choosed explicitly by the user
      in order to be used.
      
      Dynamic - this mode should be explicitly choosed by the user.
      In this mode, the FW works in optimized steering mode as long as
      it can and afterwards automatically drops to classic (full) DMFS.
      
      Disable - this mode should be explicitly choosed by the user.
      The user instructs the system not to use optimized steering, even if
      the FW supports Dynamic A0 DMFS (and thus will be able to use optimized
      steering in Default A0 DMFS mode).
      
      Default - this mode is implicitly choosed. In this mode, if the FW
      supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll
      work at Disable A0 DMFS mode.
      
      Under SRIOV configuration, when the A0 steering mode is enabled,
      older guest VF drivers who aren't using the RX QP allocation flag
      (MLX4_RESERVE_A0_QP) will get a QP from the general range and
      fail when attempting to register a steering rule. To avoid that,
      the PF context behaviour is changed once on A0 static mode, to
      require support for the allocation flag in VF drivers too.
      
      In order to enable A0 steering, we use log_num_mgm_entry_size param.
      If the value of the parameter is not positive, we treat the absolute
      value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this
      bit field enables static A0 steering.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d077cd3
    • M
      net/mlx4: Add A0 hybrid steering · d57febe1
      Matan Barak 提交于
      A0 hybrid steering is a form of high performance flow steering.
      By using this mode, mlx4 cards use a fast limited table based steering,
      in order to enable fast steering of unicast packets to a QP.
      
      In order to implement A0 hybrid steering we allocate resources
      from different zones:
      (1) General range
      (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.
      
      When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
      we try hard to allocate the QP from range (2). Otherwise, we try hard not
      to allocate from this  range. However, when the system is pushed to its
      limits and one needs every resource, the allocator uses every region it can.
      
      Meaning, when we run out of raw-eth qps, the allocator allocates from the
      general range (and the special-A0 area is no longer active). If we run out
      of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
      is also exhausted, the allocator will allocate from the general range
      (and the A0 region is no longer active).
      
      Note that if a raw-eth qp is allocated from the general range, it attempts
      to allocate the range such that bits 6 and 7 (blueflame bits) in the
      QP number are not set.
      
      When the feature is used in SRIOV, the VF has to notify the PF what
      kind of QP attributes it needs. In order to do that, along with the
      "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
      to the combination of these bits, the PF tries to allocate a suitable QP.
      
      In order to maintain backward compatibility (with older PFs), the PF
      notifies which QP attributes it supports via QUERY_FUNC_CAP command.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d57febe1