1. 19 12月, 2012 40 次提交
    • G
      memcg: allocate memory for memcg caches whenever a new memcg appears · 55007d84
      Glauber Costa 提交于
      Every cache that is considered a root cache (basically the "original"
      caches, tied to the root memcg/no-memcg) will have an array that should be
      large enough to store a cache pointer per each memcg in the system.
      
      Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
      in the 64k pointers range.  Most of the time, we won't be using that much.
      
      What goes in this patch, is a simple scheme to dynamically allocate such
      an array, in order to minimize memory usage for memcg caches.  Because we
      would also like to avoid allocations all the time, at least for now, the
      array will only grow.  It will tend to be big enough to hold the maximum
      number of kmem-limited memcgs ever achieved.
      
      We'll allocate it to be a minimum of 64 kmem-limited memcgs.  When we have
      more than that, we'll start doubling the size of this array every time the
      limit is reached.
      
      Because we are only considering kmem limited memcgs, a natural point for
      this to happen is when we write to the limit.  At that point, we already
      have set_limit_mutex held, so that will become our natural synchronization
      mechanism.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55007d84
    • G
      slab/slub: consider a memcg parameter in kmem_create_cache · 2633d7a0
      Glauber Costa 提交于
      Allow a memcg parameter to be passed during cache creation.  When the slub
      allocator is being used, it will only merge caches that belong to the same
      memcg.  We'll do this by scanning the global list, and then translating
      the cache to a memcg-specific cache
      
      Default function is created as a wrapper, passing NULL to the memcg
      version.  We only merge caches that belong to the same memcg.
      
      A helper is provided, memcg_css_id: because slub needs a unique cache name
      for sysfs.  Since this is visible, but not the canonical location for slab
      data, the cache name is not used, the css_id should suffice.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2633d7a0
    • G
      slab: annotate on-slab caches nodelist locks · 6ccfb5bc
      Glauber Costa 提交于
      We currently provide lockdep annotation for kmalloc caches, and also
      caches that have SLAB_DEBUG_OBJECTS enabled.  The reason for this is that
      we can quite frequently nest in the l3->list_lock lock, which is not
      something trivial to avoid.
      
      My proposal with this patch, is to extend this to caches whose slab
      management object lives within the slab as well ("on_slab").  The need for
      this arose in the context of testing kmemcg-slab patches.  With such
      patchset, we can have per-memcg kmalloc caches.  So the same path that led
      to nesting between kmalloc caches will could then lead to in-memcg
      nesting.  Because they are not annotated, lockdep will trigger.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ccfb5bc
    • G
      slab/slub: struct memcg_params · ba6c496e
      Glauber Costa 提交于
      For the kmem slab controller, we need to record some extra information in
      the kmem_cache structure.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6c496e
    • G
      memcg: add documentation about the kmem controller · d5bdae7d
      Glauber Costa 提交于
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bdae7d
    • G
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa 提交于
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
    • G
      memcg: execute the whole memcg freeing in free_worker() · c8b2a36f
      Glauber Costa 提交于
      A lot of the initialization we do in mem_cgroup_create() is done with
      softirqs enabled.  This include grabbing a css id, which holds
      &ss->id_lock->rlock, and the per-zone trees, which holds
      rtpz->lock->rlock.  All of those signal to the lockdep mechanism that
      those locks can be used in SOFTIRQ-ON-W context.
      
      This means that the freeing of memcg structure must happen in a
      compatible context, otherwise we'll get a deadlock, like the one below,
      caught by lockdep:
      
         free_accounted_pages+0x47/0x4c
         free_task+0x31/0x5c
         __put_task_struct+0xc2/0xdb
         put_task_struct+0x1e/0x22
         delayed_put_task_struct+0x7a/0x98
         __rcu_process_callbacks+0x269/0x3df
         rcu_process_callbacks+0x31/0x5b
         __do_softirq+0x122/0x277
      
      This usage pattern could not be triggered before kmem came into play.
      With the introduction of kmem stack handling, it is possible that we call
      the last mem_cgroup_put() from the task destructor, which is run in an rcu
      callback.  Such callbacks are run with softirqs disabled, leading to the
      offensive usage pattern.
      
      In general, we have little, if any, means to guarantee in which context
      the last memcg_put will happen.  The best we can do is test it and try to
      make sure no invalid context releases are happening.  But as we add more
      code to memcg, the possible interactions grow in number and expose more
      ways to get context conflicts.  One thing to keep in mind, is that part of
      the freeing process is already deferred to a worker, such as vfree(), that
      can only be called from process context.
      
      For the moment, the only two functions we really need moved away are:
      
        * free_css_id(), and
        * mem_cgroup_remove_from_trees().
      
      But because the later accesses per-zone info,
      free_mem_cgroup_per_zone_info() needs to be moved as well.  With that, we
      are left with the per_cpu stats only.  Better move it all.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Tested-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8b2a36f
    • G
      memcg: allow a memcg with kmem charges to be destructed · bea207c8
      Glauber Costa 提交于
      Because the ultimate goal of the kmem tracking in memcg is to track slab
      pages as well, we can't guarantee that we'll always be able to point a
      page to a particular process, and migrate the charges along with it -
      since in the common case, a page will contain data belonging to multiple
      processes.
      
      Because of that, when we destroy a memcg, we only make sure the
      destruction will succeed by discounting the kmem charges from the user
      charges when we try to empty the cgroup.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea207c8
    • G
      memcg: use static branches when code not in use · a8964b9b
      Glauber Costa 提交于
      We can use static branches to patch the code in or out when not used.
      
      Because the _ACTIVE bit on kmem_accounted is only set after the increment
      is done, we guarantee that the root memcg will always be selected for kmem
      charges until all call sites are patched (see memcg_kmem_enabled).  This
      guarantees that no mischarges are applied.
      
      Static branch decrement happens when the last reference count from the
      kmem accounting in memcg dies.  This will only happen when the charges
      drop down to 0.
      
      When that happens, we need to disable the static branch only on those
      memcgs that enabled it.  To achieve this, we would be forced to complicate
      the code by keeping track of which memcgs were the ones that actually
      enabled limits, and which ones got it from its parents.
      
      It is a lot simpler just to do static_key_slow_inc() on every child
      that is accounted.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8964b9b
    • G
      memcg: kmem accounting lifecycle management · 7de37682
      Glauber Costa 提交于
      Because kmem charges can outlive the cgroup, we need to make sure that we
      won't free the memcg structure while charges are still in flight.  For
      reviewing simplicity, the charge functions will issue mem_cgroup_get() at
      every charge, and mem_cgroup_put() at every uncharge.
      
      This can get expensive, however, and we can do better.  mem_cgroup_get()
      only really needs to be issued once: when the first limit is set.  In the
      same spirit, we only need to issue mem_cgroup_put() when the last charge
      is gone.
      
      We'll need an extra bit in kmem_account_flags for that:
      KMEM_ACCOUNTED_DEAD.  it will be set when the cgroup dies, if there are
      charges in the group.  If there aren't, we can proceed right away.
      
      Our uncharge function will have to test that bit every time the charges
      drop to 0.  Because that is not the likely output of res_counter_uncharge,
      this should not impose a big hit on us: it is certainly much better than a
      reference count decrease at every operation.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7de37682
    • G
      res_counter: return amount of charges after res_counter_uncharge() · 50bdd430
      Glauber Costa 提交于
      It is useful to know how many charges are still left after a call to
      res_counter_uncharge.  While it is possible to issue a res_counter_read
      after uncharge, this can be racy.
      
      If we need, for instance, to take some action when the counters drop down
      to 0, only one of the callers should see it.  This is the same semantics
      as the atomic variables in the kernel.
      
      Since the current return value is void, we don't need to worry about
      anything breaking due to this change: nobody relied on that, and only
      users appearing from now on will be checking this value.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50bdd430
    • G
      mm: allocate kernel pages to the right memcg · 6a1a0d3b
      Glauber Costa 提交于
      When a process tries to allocate a page with the __GFP_KMEMCG flag, the
      page allocator will call the corresponding memcg functions to validate
      the allocation.  Tasks in the root memcg can always proceed.
      
      To avoid adding markers to the page - and a kmem flag that would
      necessarily follow, as much as doing page_cgroup lookups for no reason,
      whoever is marking its allocations with __GFP_KMEMCG flag is responsible
      for telling the page allocator that this is such an allocation at
      free_pages() time.  This is done by the invocation of
      __free_accounted_pages() and free_accounted_pages().
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a1a0d3b
    • G
      memcg: kmem controller infrastructure · 7ae1e1d0
      Glauber Costa 提交于
      Introduce infrastructure for tracking kernel memory pages to a given
      memcg.  This will happen whenever the caller includes the flag
      __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
      
      In memcontrol.h those functions are wrapped in inline acessors.  The idea
      is to later on, patch those with static branches, so we don't incur any
      overhead when no mem cgroups with limited kmem are being used.
      
      Users of this functionality shall interact with the memcg core code
      through the following functions:
      
      memcg_kmem_newpage_charge: will return true if the group can handle the
                                 allocation. At this point, struct page is not
                                 yet allocated.
      
      memcg_kmem_commit_charge: will either revert the charge, if struct page
                                allocation failed, or embed memcg information
                                into page_cgroup.
      
      memcg_kmem_uncharge_page: called at free time, will revert the charge.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ae1e1d0
    • G
      mm: add a __GFP_KMEMCG flag · 7a64bf05
      Glauber Costa 提交于
      This flag is used to indicate to the callees that this allocation is a
      kernel allocation in process context, and should be accounted to current's
      memcg.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a64bf05
    • G
      memcg: kmem accounting basic infrastructure · 510fc4e1
      Glauber Costa 提交于
      Add the basic infrastructure for the accounting of kernel memory.  To
      control that, the following files are created:
      
       * memory.kmem.usage_in_bytes
       * memory.kmem.limit_in_bytes
       * memory.kmem.failcnt
       * memory.kmem.max_usage_in_bytes
      
      They have the same meaning of their user memory counterparts.  They
      reflect the state of the "kmem" res_counter.
      
      Per cgroup kmem memory accounting is not enabled until a limit is set for
      the group.  Once the limit is set the accounting cannot be disabled for
      that group.  This means that after the patch is applied, no behavioral
      changes exists for whoever is still using memcg to control their memory
      usage, until memory.kmem.limit_in_bytes is set for the first time.
      
      We always account to both user and kernel resource_counters.  This
      effectively means that an independent kernel limit is in place when the
      limit is set to a lower value than the user memory.  A equal or higher
      value means that the user limit will always hit first, meaning that kmem
      is effectively unlimited.
      
      People who want to track kernel memory but not limit it, can set this
      limit to a very high number (like RESOURCE_MAX - 1page - that no one will
      ever hit, or equal to the user memory)
      
      [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510fc4e1
    • G
      memcg: change defines to an enum · 86ae53e1
      Glauber Costa 提交于
      This is just a cleanup patch for clarity of expression.  In earlier
      submissions, people asked it to be in a separate patch, so here it is.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86ae53e1
    • S
      memcg: reclaim when more than one page needed · 4c9c5359
      Suleiman Souhlal 提交于
      mem_cgroup_do_charge() was written before kmem accounting, and expects
      three cases: being called for 1 page, being called for a stock of 32
      pages, or being called for a hugepage.  If we call for 2 or 3 pages (and
      both the stack and several slabs used in process creation are such, at
      least with the debug options I had), it assumed it's being called for
      stock and just retried without reclaiming.
      
      Fix that by passing down a minsize argument in addition to the csize.
      
      And what to do about that (csize == PAGE_SIZE && ret) retry?  If it's
      needed at all (and presumably is since it's there, perhaps to handle
      races), then it should be extended to more than PAGE_SIZE, yet how far?
      And should there be a retry count limit, of what?  For now retry up to
      COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
      __GFP_NORETRY.
      
      v4: fixed nr pages calculation pointed out by Christoph Lameter.
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c9c5359
    • S
      memcg: make it possible to use the stock for more than one page · a0956d54
      Suleiman Souhlal 提交于
      We currently have a percpu stock cache scheme that charges one page at a
      time from memcg->res, the user counter.  When the kernel memory controller
      comes into play, we'll need to charge more than that.
      
      This is because kernel memory allocations will also draw from the user
      counter, and can be bigger than a single page, as it is the case with the
      stack (usually 2 pages) or some higher order slabs.
      
      [glommer@parallels.com: added a changelog ]
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0956d54
    • T
      memory-hotplug: document and enable CONFIG_MOVABLE_NODE · c2974058
      Tang Chen 提交于
      Add help info for CONFIG_MOVABLE_NODE and permit its selection.
      
      This option allows the user to online all memory of a node as movable
      memory.  So that the whole node can be hotplugged.  Users who don't use
      the hotplug feature are also fine with this option on since they won't
      online memory as movable.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      [akpm@linux-foundation.org: tweak help text]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2974058
    • G
      mm/page_alloc.c: remove duplicate check · 0bb2c763
      Gavin Shan 提交于
      While allocating pages using buddy allocator, the compound page is
      probably split up to free pages.  Under these circumstances, the compound
      page should be destroyed by destroy_compound_page().  However, there is a
      duplicate check to judge if the page is compound.
      
      Remove the duplicate check since the compound_order() returns 0 when the
      page doesn't have PG_head set in destroy_compound_page().  That is to say,
      destroy_compound_page() needn't check PageHead().
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bb2c763
    • A
      drivers/message/fusion/mptscsih.c: missing break · 3012d60b
      Alan Cox 提交于
      This happens to do the right thing in all cases on fibre channel but not on
      other media types
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com>
      Cc: Kashyap Desai <kashyap.desai@lsi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3012d60b
    • F
      h8300: select generic atomic64_t support · d95bfe46
      Fengguang Wu 提交于
      Rationales from Eric:
      
      So I just looked a little deeper and it appears architectures that do
      not support atomic64_t are broken.
      
      The generic atomic64 support came in 2009 to support the perf subsystem
      with the expectation that all architectures would implement atomic64
      support.
      
      Furthermore upon inspection of the kernel atomic64_t is used in a fair
      number of places beyond the performance counters:
      
      block/blk-cgroup.c
      drivers/acpi/apei/
      drivers/block/rbd.c
      drivers/crypto/nx/nx.h
      drivers/gpu/drm/radeon/radeon.h
      drivers/infiniband/hw/ipath/
      drivers/infiniband/hw/qib/
      drivers/staging/octeon/
      fs/xfs/
      include/linux/perf_event.h
      include/net/netfilter/nf_conntrack_acct.h
      kernel/events/
      kernel/trace/
      net/mac80211/key.h
      net/rds/
      
      The block control group, infiniband, xfs, crypto, 802.11, netfilter.
      Nothing quite so fundamental as fs/namespace.c but definitely in
      multiplatform-code that should work, and is already broken on those
      architecutres.
      
      Looking at the implementation of atomic64_add_return in lib/atomic64.c the
      code looks as efficient as these kinds of things get.
      
      Which leads me to the conclusion that we need atomic64 support on all
      architectures.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d95bfe46
    • C
      Coccinelle: add api/d_find_alias.cocci · af56e3f0
      Cyril Roelandt 提交于
      Ensure that calls to d_find_alias() have a corresponding dput().
      Signed-off-by: NCyril Roelandt <tipecaml@gmail.com>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Gilles Muller <Gilles.Muller@lip6.fr>
      Cc: Nicolas Palix <nicolas.palix@imag.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af56e3f0
    • A
      irq: tsk->comm is an array · 19af395d
      Alan Cox 提交于
      The array check is useless so remove it.
      
      [akpm@linux-foundation.org: remove comment, per David]
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19af395d
    • C
      ceph: fix dentry reference leak in ceph_encode_fh() · f6af75da
      Cyril Roelandt 提交于
      dput() was not called in the error path.
      Signed-off-by: NCyril Roelandt <tipecaml@gmail.com>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6af75da
    • S
      arch/x86/platform/iris/iris.c: register a platform device and a platform driver · 88d67ee3
      Shérab 提交于
      This makes the iris driver use the platform API, so it is properly exposed
      in /sys.
      
      [akpm@linux-foundation.org: remove commented-out code, add missing space to printk, clean up code layout]
      Signed-off-by: NShérab <Sebastien.Hinderer@ens-lyon.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Matthew Garrett <mjg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88d67ee3
    • C
      CRIS: fix I/O macros · c24bf9b4
      Corey Minyard 提交于
      The inb/outb macros for CRIS are broken from a number of points of view,
      missing () around parameters and they have an unprotected if statement
      in them.  This was breaking the compile of IPMI on CRIS and thus I was
      being annoyed by build regressions, so I fixed them.
      
      Plus I don't think they would have worked at all, since the data values
      were missing "&" and the outsl had a "3" instead of a "4" for the size.
      From what I can tell, this stuff is not used at all, so this can't be
      any more broken than it was before, anyway.
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c24bf9b4
    • J
      backlight: locomolcd: fix checkpatch error and warning · 9f67675a
      Jingoo Han 提交于
      This patch fixes the checkpatch error and warning as below:
      
        WARNING: space prohibited between function name and open parenthesis '('
        ERROR: trailing statements should be on next line
      
      Also, long comments are fixed for the preferred style and unnecessary
      lines are removed.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f67675a
    • L
      Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux · ae664dba
      Linus Torvalds 提交于
      Pull SLAB changes from Pekka Enberg:
       "This contains preparational work from Christoph Lameter and Glauber
        Costa for SLAB memcg and cleanups and improvements from Ezequiel
        Garcia and Joonsoo Kim.
      
        Please note that the SLOB cleanup commit from Arnd Bergmann already
        appears in your tree but I had also merged it myself which is why it
        shows up in the shortlog."
      
      * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
        mm/sl[aou]b: Common alignment code
        slab: Use the new create_boot_cache function to simplify bootstrap
        slub: Use statically allocated kmem_cache boot structure for bootstrap
        mm, sl[au]b: create common functions for boot slab creation
        slab: Simplify bootstrap
        slub: Use correct cpu_slab on dead cpu
        mm: fix slab.c kernel-doc warnings
        mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
        slab: Ignore internal flags in cache creation
        mm/slob: Use free_page instead of put_page for page-size kmalloc allocations
        mm/sl[aou]b: Move common kmem_cache_size() to slab.h
        mm/slob: Use object_size field in kmem_cache_size()
        mm/slob: Drop usage of page->private for storing page-sized allocations
        slub: Commonize slab_cache field in struct page
        sl[au]b: Process slabinfo_show in common code
        mm/sl[au]b: Move print_slabinfo_header to slab_common.c
        mm/sl[au]b: Move slabinfo processing to slab_common.c
        slub: remove one code path and reduce lock contention in __slab_free()
      ae664dba
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · a2faf2fc
      Linus Torvalds 提交于
      Pull (again) user namespace infrastructure changes from Eric Biederman:
       "Those bugs, those darn embarrasing bugs just want don't want to get
        fixed.
      
        Linus I just updated my mirror of your kernel.org tree and it appears
        you successfully pulled everything except the last 4 commits that fix
        those embarrasing bugs.
      
        When you get a chance can you please repull my branch"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        userns: Fix typo in description of the limitation of userns_install
        userns: Add a more complete capability subset test to commit_creds
        userns: Require CAP_SYS_ADMIN for most uses of setns.
        Fix cap_capable to only allow owners in the parent user namespace to have caps.
      a2faf2fc
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin · 4351654e
      Linus Torvalds 提交于
      Pull blackfin update from Bob Liu.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin:
        blackfin: SEC: clean up SEC interrupt initialization
        blackfin: kgdb: call generic_exec_single() directly
        blackfin: anomaly: add anomaly 16000030 for bf5xx
        Blackfin: dpmc: use module_platform_driver macro
        Blackfin: remove unused is_in_rom()
        Blackfin: remove unnecessary prototype for kobjsize()
        Blackfin: twi: Add missing __iomem annotation
        Blackfin: Annotate strnlen_user and strlen_user 'src' parameter with __user
        Blackfin: Annotate clear_user 'to' parameter with __user
        Blackfin: Add missing __user annotations to put_user
        Blackfin: Annotate strncpy_from_user src parameter with __user
        blackfin: Use Kbuild infrastructure for kvm_para.h
        UAPI: (Scripted) Disintegrate arch/blackfin/include/asm
      4351654e
    • L
      Merge tag 'disintegrate-alpha-20121217' of git://git.infradead.org/users/dhowells/linux-headers · 3d9de190
      Linus Torvalds 提交于
      Pull UAPI disintegration for Alpha from David Howells:
       "I've been asked to send the Alpha UAPI disintegration to you directly.
        The acks I have been given have been added into the patch."
      
      * tag 'disintegrate-alpha-20121217' of git://git.infradead.org/users/dhowells/linux-headers:
        UAPI: (Scripted) Disintegrate arch/alpha/include/asm
      3d9de190
    • L
      Merge tag 'for-3.8' of git://openrisc.net/~jonas/linux · 9a8a5702
      Linus Torvalds 提交于
      Pull OpenRISC update from Jonas Bonn:
       "Trivial cleanups for OpenRISC."
      
      * tag 'for-3.8' of git://openrisc.net/~jonas/linux:
        openrisc: use kbuild.h instead of defining macros in asm-offset.c
        openrisc: Use Kbuild infrastructure for kvm_para.h
      9a8a5702
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 7b077868
      Linus Torvalds 提交于
      Pull s390 update #2 from Martin Schwidefsky:
       "The main patch is the function measurement blocks extension for PCI to
        do performance statistics and help with debugging.  The other patch is
        a small cleanup in ccwdev.h."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/ccwdev: Include asm/schid.h.
        s390/pci: performance statistics and debug infrastructure
      7b077868
    • L
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc · 16e024f3
      Linus Torvalds 提交于
      Pull powerpc update from Benjamin Herrenschmidt:
       "The main highlight is probably some base POWER8 support.  There's more
        to come such as transactional memory support but that will wait for
        the next one.
      
        Overall it's pretty quiet, or rather I've been pretty poor at picking
        things up from patchwork and reviewing them this time around and Kumar
        no better on the FSL side it seems..."
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
        powerpc+of: Rename and fix OF reconfig notifier error inject module
        powerpc: mpc5200: Add a3m071 board support
        powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
        powerpc/mpc52xx: use module_platform_driver macro
        powerpc+of: Export of_reconfig_notifier_[register,unregister]
        powerpc/dma/raidengine: add raidengine device
        powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
        powerpc/mpc85xx: Change spin table to cached memory
        powerpc/fsl-pci: Add PCI controller ATMU PM support
        powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
        drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
        powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
        powerpc: Disable relocation on exceptions when kexecing
        powerpc: Enable relocation on during exceptions at boot
        powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
        powerpc: Add wrappers to enable/disable relocation on exceptions
        powerpc: Add set_mode hcall
        powerpc: Setup relocation on exceptions for bare metal systems
        powerpc: Move initial mfspr LPCR out of __init_LPCR
        powerpc: Add relocation on exception vector handlers
        ...
      16e024f3
    • D
      x86, paravirt: fix build error when thp is disabled · c36e0501
      David Rientjes 提交于
      With CONFIG_PARAVIRT=y and CONFIG_TRANSPARENT_HUGEPAGE=n, the build breaks
      because set_pmd_at() is undeclared:
      
        mm/memory.c: In function 'do_pmd_numa_page':
        mm/memory.c:3520: error: implicit declaration of function 'set_pmd_at'
        mm/mprotect.c: In function 'change_pmd_protnuma':
        mm/mprotect.c:120: error: implicit declaration of function 'set_pmd_at'
      
      This is because paravirt defines set_pmd_at() only when
      CONFIG_TRANSPARENT_HUGEPAGE=y and such a restriction is unneeded.  The
      fix is to define it for all CONFIG_PARAVIRT configurations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c36e0501
    • L
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · ea77d73c
      Linus Torvalds 提交于
      Pull exofs changes from Boaz Harrosh:
       "These are just 3 patches, the last two are bug fixes on the error
        paths in exofs.
      
        The important patch is the one to osd_uld which adds sysfs info to osd
        devices for use by user-mode clustering discovery software.  I'm
        already sitting on this patch since before February this year, It is
        important for some of the big installation cluster systems, who's been
        compiling their own kernel just for that patch."
      
      Ugh.  The osd_uld patch already went through the SCSI tree, so this was
      kind of pointless.  But at least it has the two small error-path fixes..
      
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        exofs: don't leak io_state and pages on read error
        exofs: clean up the correct page collection on write error
        osduld: Add osdname & systemid sysfs at scsi_osd class
      ea77d73c
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · a22180d2
      Linus Torvalds 提交于
      Pull btrfs update from Chris Mason:
       "A big set of fixes and features.
      
        In terms of line count, most of the code comes from Stefan, who added
        the ability to replace a single drive in place.  This is different
        from how btrfs normally replaces drives, and is much much much faster.
      
        Josef is plowing through our synchronous write performance.  This pull
        request does not include the DIO_OWN_WAITING patch that was discussed
        on the list, but it has a number of other improvements to cut down our
        latencies and CPU time during fsync/O_DIRECT writes.
      
        Miao Xie has a big series of fixes and is spreading out ordered
        operations over more CPUs.  This improves performance and reduces
        contention.
      
        I've put in fixes for error handling around hash collisions.  These
        are going back to individual stable kernels as I test against them.
      
        Otherwise we have a lot of fixes and cleanups, thanks everyone!
        raid5/6 is being rebased against the device replacement code.  I'll
        have it posted this Friday along with a nice series of benchmarks."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
        Btrfs: fix a bug of per-file nocow
        Btrfs: fix hash overflow handling
        Btrfs: don't take inode delalloc mutex if we're a free space inode
        Btrfs: fix autodefrag and umount lockup
        Btrfs: fix permissions of empty files not affected by umask
        Btrfs: put raid properties into global table
        Btrfs: fix BUG() in scrub when first superblock reading gives EIO
        Btrfs: do not call file_update_time in aio_write
        Btrfs: only unlock and relock if we have to
        Btrfs: use tokens where we can in the tree log
        Btrfs: optimize leaf_space_used
        Btrfs: don't memset new tokens
        Btrfs: only clear dirty on the buffer if it is marked as dirty
        Btrfs: move checks in set_page_dirty under DEBUG
        Btrfs: log changed inodes based on the extent map tree
        Btrfs: add path->really_keep_locks
        Btrfs: do not mark ems as prealloc if we are writing to them
        Btrfs: keep track of the extents original block length
        Btrfs: inline csums if we're fsyncing
        Btrfs: don't bother copying if we're only logging the inode
        ...
      a22180d2
    • L
      Merge tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 2d4dce00
      Linus Torvalds 提交于
      Pull NFS client updates from Trond Myklebust:
       "Features include:
      
         - Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
           code.  Remove altogether where possible, and replace with
           WARN_ON_ONCE and appropriate error returns where not.
         - NFSv4.1 client adds session dynamic slot table management.  There
           is matching server side code that has been submitted to Bruce for
           consideration.
      
           Together, this code allows the server to dynamically manage the
           amount of memory it allocates to the duplicate request cache for
           each client.  It will constantly resize those caches to reserve
           more memory for clients that are hot while shrinking caches for
           those that are quiescent.
      
        In addition, there are assorted bugfixes for the generic NFS write
        code, fixes to deal with the drop_nlink() warnings, and yet another
        fix for NFSv4 getacl."
      
      * tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
        SUNRPC: continue run over clients list on PipeFS event instead of break
        NFS: Don't use SetPageError in the NFS writeback code
        SUNRPC: variable 'svsk' is unused in function bc_send_request
        SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
        NFSv4.1: Deal effectively with interrupted RPC calls.
        NFSv4.1: Move the RPC timestamp out of the slot.
        NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
        NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
        NFS: Fix calls to drop_nlink()
        NFS: Ensure that we always drop inodes that have been marked as stale
        nfs: Remove unused list nfs4_clientid_list
        nfs: Remove duplicate function declaration in internal.h
        NFS: avoid NULL dereference in nfs_destroy_server
        SUNRPC handle EKEYEXPIRED in call_refreshresult
        SUNRPC set gss gc_expiry to full lifetime
        nfs: fix page dirtying in NFS DIO read codepath
        nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
        NFSv4.1: Be conservative about the client highest slotid
        NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
        nfs: don't extend writes to cover entire page if pagecache is invalid
        ...
      2d4dce00
    • L
      Merge tag 'md-3.8' of git://neil.brown.name/md · ea88eeac
      Linus Torvalds 提交于
      Pull md update from Neil Brown:
       "Mostly just little fixes.  Probably biggest part is AVX accelerated
        RAID6 calculations."
      
      * tag 'md-3.8' of git://neil.brown.name/md:
        md/raid5: add blktrace calls
        md/raid5: use async_tx_quiesce() instead of open-coding it.
        md: Use ->curr_resync as last completed request when cleanly aborting resync.
        lib/raid6: build proper files on corresponding arch
        lib/raid6: Add AVX2 optimized gen_syndrome functions
        lib/raid6: Add AVX2 optimized recovery functions
        md: Update checkpoint of resync/recovery based on time.
        md:Add place to update ->recovery_cp.
        md.c: re-indent various 'switch' statements.
        md: close race between removing and adding a device.
        md: removed unused variable in calc_sb_1_csm.
      ea88eeac