1. 19 12月, 2012 21 次提交
    • L
      blk: avoid divide-by-zero with zero discard granularity · 59771079
      Linus Torvalds 提交于
      Commit 8dd2cb7e ("block: discard granularity might not be power of
      2") changed a couple of 'binary and' operations into modulus operations.
      Which turned the harmless case of a zero discard_granularity into a
      possible divide-by-zero.
      
      The code also had a much more subtle bug: it was doing the modulus of a
      value in bytes using 'sector_t'.  That was always conceptually wrong,
      but didn't actually matter back when the code assumed a power-of-two
      granularity: we only looked at the low bits anyway.
      
      But with potentially arbitrary sector numbers, using a 'sector_t' to
      express bytes is very very wrong: depending on configuration it limits
      the starting offset of the device to just 32 bits, and any overflow
      would result in a wrong value if the modulus wasn't a power-of-two.
      
      So re-write the code to not only protect against the divide-by-zero, but
      to do the starting sector arithmetic in sectors, and using the proper
      types.
      
      [ For any mathematicians out there: it also looks monumentally stupid to
        do the 'modulo granularity' operation *twice*, never mind having a "+
        granularity" in the second modulus op.
      
        But that's the easiest way to avoid negative values or overflow, and
        it is how the original code was done. ]
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Reported-by: NDoug Anderson <dianders@chromium.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59771079
    • J
      mm/hugetlb: create hugetlb cgroup file in hugetlb_init · 7179e7bf
      Jianguo Wu 提交于
      Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
      CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
      will fail to boot.
      
      This failure is caused by following code path:
      
        setup_hugepagesz
          hugetlb_add_hstate
            hugetlb_cgroup_file_init
              cgroup_add_cftypes
                kzalloc <--slab is *not available* yet
      
      For this path, slab is not available yet, so memory allocated will be
      failed, and cause WARN_ON() in hugetlb_cgroup_file_init().
      
      So I move hugetlb_cgroup_file_init() into hugetlb_init().
      
      [akpm@linux-foundation.org: tweak coding-style, remove pointless __init on inlined function]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7179e7bf
    • G
      memcg: add comments clarifying aspects of cache attribute propagation · ebe945c2
      Glauber Costa 提交于
      This patch clarifies two aspects of cache attribute propagation.
      
      First, the expected context for the for_each_memcg_cache macro in
      memcontrol.h.  The usages already in the codebase are safe.  In mm/slub.c,
      it is trivially safe because the lock is acquired right before the loop.
      In mm/slab.c, it is less so: the lock is acquired by an outer function a
      few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
      indeed safe.
      
      A comment is also added to detail why we are returning the value of the
      parent cache and ignoring the children's when we propagate the attributes.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebe945c2
    • G
      slub: slub-specific propagation changes · 107dab5c
      Glauber Costa 提交于
      SLUB allows us to tune a particular cache behavior with sysfs-based
      tunables.  When creating a new memcg cache copy, we'd like to preserve any
      tunables the parent cache already had.
      
      This can be done by tapping into the store attribute function provided by
      the allocator.  We of course don't need to mess with read-only fields.
      Since the attributes can have multiple types and are stored internally by
      sysfs, the best strategy is to issue a ->show() in the root cache, and
      then ->store() in the memcg cache.
      
      The drawback of that, is that sysfs can allocate up to a page in buffering
      for show(), that we are likely not to need, but also can't guarantee.  To
      avoid always allocating a page for that, we can update the caches at store
      time with the maximum attribute size ever stored to the root cache.  We
      will then get a buffer big enough to hold it.  The corolary to this, is
      that if no stores happened, nothing will be propagated.
      
      It can also happen that a root cache has its tunables updated during
      normal system operation.  In this case, we will propagate the change to
      all caches that are already active.
      
      [akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      107dab5c
    • G
      slab: propagate tunable values · 943a451a
      Glauber Costa 提交于
      SLAB allows us to tune a particular cache behavior with tunables.  When
      creating a new memcg cache copy, we'd like to preserve any tunables the
      parent cache already had.
      
      This could be done by an explicit call to do_tune_cpucache() after the
      cache is created.  But this is not very convenient now that the caches are
      created from common code, since this function is SLAB-specific.
      
      Another method of doing that is taking advantage of the fact that
      do_tune_cpucache() is always called from enable_cpucache(), which is
      called at cache initialization.  We can just preset the values, and then
      things work as expected.
      
      It can also happen that a root cache has its tunables updated during
      normal system operation.  In this case, we will propagate the change to
      all caches that are already active.
      
      This change will require us to move the assignment of root_cache in
      memcg_params a bit earlier.  We need this to be already set - which
      memcg_kmem_register_cache will do - when we reach __kmem_cache_create()
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      943a451a
    • G
      memcg: aggregate memcg cache values in slabinfo · 749c5415
      Glauber Costa 提交于
      When we create caches in memcgs, we need to display their usage
      information somewhere.  We'll adopt a scheme similar to /proc/meminfo,
      with aggregate totals shown in the global file, and per-group information
      stored in the group itself.
      
      For the time being, only reads are allowed in the per-group cache.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      749c5415
    • G
      memcg/sl[au]b: track all the memcg children of a kmem_cache · 7cf27982
      Glauber Costa 提交于
      This enables us to remove all the children of a kmem_cache being
      destroyed, if for example the kernel module it's being used in gets
      unloaded.  Otherwise, the children will still point to the destroyed
      parent.
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf27982
    • G
      memcg: destroy memcg caches · 1f458cbf
      Glauber Costa 提交于
      Implement destruction of memcg caches.  Right now, only caches where our
      reference counter is the last remaining are deleted.  If there are any
      other reference counters around, we just leave the caches lying around
      until they go away.
      
      When that happens, a destruction function is called from the cache code.
      Caches are only destroyed in process context, so we queue them up for
      later processing in the general case.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f458cbf
    • G
      sl[au]b: allocate objects from memcg cache · d79923fa
      Glauber Costa 提交于
      We are able to match a cache allocation to a particular memcg.  If the
      task doesn't change groups during the allocation itself - a rare event,
      this will give us a good picture about who is the first group to touch a
      cache page.
      
      This patch uses the now available infrastructure by calling
      memcg_kmem_get_cache() before all the cache allocations.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d79923fa
    • G
      sl[au]b: always get the cache from its page in kmem_cache_free() · b9ce5ef4
      Glauber Costa 提交于
      struct page already has this information.  If we start chaining caches,
      this information will always be more trustworthy than whatever is passed
      into the function.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ce5ef4
    • G
      memcg: skip memcg kmem allocations in specified code regions · 0e9d92f2
      Glauber Costa 提交于
      Create a mechanism that skip memcg allocations during certain pieces of
      our core code.  It basically works in the same way as
      preempt_disable()/preempt_enable(): By marking a region under which all
      allocations will be accounted to the root memcg.
      
      We need this to prevent races in early cache creation, when we
      allocate data using caches that are not necessarily created already.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      yCc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e9d92f2
    • G
      memcg: infrastructure to match an allocation to the right cache · d7f25f8a
      Glauber Costa 提交于
      The page allocator is able to bind a page to a memcg when it is
      allocated.  But for the caches, we'd like to have as many objects as
      possible in a page belonging to the same cache.
      
      This is done in this patch by calling memcg_kmem_get_cache in the
      beginning of every allocation function.  This function is patched out by
      static branches when kernel memory controller is not being used.
      
      It assumes that the task allocating, which determines the memcg in the
      page allocator, belongs to the same cgroup throughout the whole process.
      Misaccounting can happen if the task calls memcg_kmem_get_cache() while
      belonging to a cgroup, and later on changes.  This is considered
      acceptable, and should only happen upon task migration.
      
      Before the cache is created by the memcg core, there is also a possible
      imbalance: the task belongs to a memcg, but the cache being allocated from
      is the global cache, since the child cache is not yet guaranteed to be
      ready.  This case is also fine, since in this case the GFP_KMEMCG will not
      be passed and the page allocator will not attempt any cgroup accounting.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7f25f8a
    • G
      memcg: allocate memory for memcg caches whenever a new memcg appears · 55007d84
      Glauber Costa 提交于
      Every cache that is considered a root cache (basically the "original"
      caches, tied to the root memcg/no-memcg) will have an array that should be
      large enough to store a cache pointer per each memcg in the system.
      
      Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
      in the 64k pointers range.  Most of the time, we won't be using that much.
      
      What goes in this patch, is a simple scheme to dynamically allocate such
      an array, in order to minimize memory usage for memcg caches.  Because we
      would also like to avoid allocations all the time, at least for now, the
      array will only grow.  It will tend to be big enough to hold the maximum
      number of kmem-limited memcgs ever achieved.
      
      We'll allocate it to be a minimum of 64 kmem-limited memcgs.  When we have
      more than that, we'll start doubling the size of this array every time the
      limit is reached.
      
      Because we are only considering kmem limited memcgs, a natural point for
      this to happen is when we write to the limit.  At that point, we already
      have set_limit_mutex held, so that will become our natural synchronization
      mechanism.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55007d84
    • G
      slab/slub: consider a memcg parameter in kmem_create_cache · 2633d7a0
      Glauber Costa 提交于
      Allow a memcg parameter to be passed during cache creation.  When the slub
      allocator is being used, it will only merge caches that belong to the same
      memcg.  We'll do this by scanning the global list, and then translating
      the cache to a memcg-specific cache
      
      Default function is created as a wrapper, passing NULL to the memcg
      version.  We only merge caches that belong to the same memcg.
      
      A helper is provided, memcg_css_id: because slub needs a unique cache name
      for sysfs.  Since this is visible, but not the canonical location for slab
      data, the cache name is not used, the css_id should suffice.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2633d7a0
    • G
      slab/slub: struct memcg_params · ba6c496e
      Glauber Costa 提交于
      For the kmem slab controller, we need to record some extra information in
      the kmem_cache structure.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6c496e
    • G
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa 提交于
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
    • G
      memcg: use static branches when code not in use · a8964b9b
      Glauber Costa 提交于
      We can use static branches to patch the code in or out when not used.
      
      Because the _ACTIVE bit on kmem_accounted is only set after the increment
      is done, we guarantee that the root memcg will always be selected for kmem
      charges until all call sites are patched (see memcg_kmem_enabled).  This
      guarantees that no mischarges are applied.
      
      Static branch decrement happens when the last reference count from the
      kmem accounting in memcg dies.  This will only happen when the charges
      drop down to 0.
      
      When that happens, we need to disable the static branch only on those
      memcgs that enabled it.  To achieve this, we would be forced to complicate
      the code by keeping track of which memcgs were the ones that actually
      enabled limits, and which ones got it from its parents.
      
      It is a lot simpler just to do static_key_slow_inc() on every child
      that is accounted.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8964b9b
    • G
      res_counter: return amount of charges after res_counter_uncharge() · 50bdd430
      Glauber Costa 提交于
      It is useful to know how many charges are still left after a call to
      res_counter_uncharge.  While it is possible to issue a res_counter_read
      after uncharge, this can be racy.
      
      If we need, for instance, to take some action when the counters drop down
      to 0, only one of the callers should see it.  This is the same semantics
      as the atomic variables in the kernel.
      
      Since the current return value is void, we don't need to worry about
      anything breaking due to this change: nobody relied on that, and only
      users appearing from now on will be checking this value.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50bdd430
    • G
      mm: allocate kernel pages to the right memcg · 6a1a0d3b
      Glauber Costa 提交于
      When a process tries to allocate a page with the __GFP_KMEMCG flag, the
      page allocator will call the corresponding memcg functions to validate
      the allocation.  Tasks in the root memcg can always proceed.
      
      To avoid adding markers to the page - and a kmem flag that would
      necessarily follow, as much as doing page_cgroup lookups for no reason,
      whoever is marking its allocations with __GFP_KMEMCG flag is responsible
      for telling the page allocator that this is such an allocation at
      free_pages() time.  This is done by the invocation of
      __free_accounted_pages() and free_accounted_pages().
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a1a0d3b
    • G
      memcg: kmem controller infrastructure · 7ae1e1d0
      Glauber Costa 提交于
      Introduce infrastructure for tracking kernel memory pages to a given
      memcg.  This will happen whenever the caller includes the flag
      __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
      
      In memcontrol.h those functions are wrapped in inline acessors.  The idea
      is to later on, patch those with static branches, so we don't incur any
      overhead when no mem cgroups with limited kmem are being used.
      
      Users of this functionality shall interact with the memcg core code
      through the following functions:
      
      memcg_kmem_newpage_charge: will return true if the group can handle the
                                 allocation. At this point, struct page is not
                                 yet allocated.
      
      memcg_kmem_commit_charge: will either revert the charge, if struct page
                                allocation failed, or embed memcg information
                                into page_cgroup.
      
      memcg_kmem_uncharge_page: called at free time, will revert the charge.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ae1e1d0
    • G
      mm: add a __GFP_KMEMCG flag · 7a64bf05
      Glauber Costa 提交于
      This flag is used to indicate to the callees that this allocation is a
      kernel allocation in process context, and should be accounted to current's
      memcg.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a64bf05
  2. 18 12月, 2012 19 次提交
    • C
      fs, exportfs: add exportfs_encode_inode_fh() helper · 711c7bf9
      Cyrill Gorcunov 提交于
      We will need this helper in the next patch to provide a file handle for
      inotify marks in /proc/pid/fdinfo output.
      
      The patch is rather providing the way to use inodes directly when dentry
      is not available (like in case of inotify system).
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      711c7bf9
    • C
      fs, epoll: add procfs fdinfo helper · 138d22b5
      Cyrill Gorcunov 提交于
      This allows us to print out eventpoll target file descriptor, events and
      data, the /proc/pid/fdinfo/fd consists of
      
       | pos:	0
       | flags:	02
       | tfd:        5 events:       1d data: ffffffffffffffff enabled: 1
      
      [avagin@: fix for unitialized ret variable]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      138d22b5
    • C
      procfs: add ability to plug in auxiliary fdinfo providers · 55985dd7
      Cyrill Gorcunov 提交于
      This patch brings ability to print out auxiliary data associated with
      file in procfs interface /proc/pid/fdinfo/fd.
      
      In particular further patches make eventfd, evenpoll, signalfd and
      fsnotify to print additional information complete enough to restore
      these objects after checkpoint.
      
      To simplify the code we add show_fdinfo callback inside struct
      file_operations (as Al and Pavel are proposing).
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55985dd7
    • A
      prandom: introduce prandom_bytes() and prandom_bytes_state() · 6582c665
      Akinobu Mita 提交于
      Add functions to get the requested number of pseudo-random bytes.
      
      The difference from get_random_bytes() is that it generates pseudo-random
      numbers by prandom_u32().  It doesn't consume the entropy pool, and the
      sequence is reproducible if the same rnd_state is used.  So it is suitable
      for generating random bytes for testing.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Cc: David Laight <david.laight@aculab.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6582c665
    • A
      random32: rename random32 to prandom · 496f2f93
      Akinobu Mita 提交于
      This renames all random32 functions to have 'prandom_' prefix as follows:
      
        void prandom_seed(u32 seed);	/* rename from srandom32() */
        u32 prandom_u32(void);		/* rename from random32() */
        void prandom_seed_state(struct rnd_state *state, u64 seed);
        				/* rename from prandom32_seed() */
        u32 prandom_u32_state(struct rnd_state *state);
        				/* rename from prandom32() */
      
      The purpose of this renaming is to prevent some kernel developers from
      assuming that prandom32() and random32() might imply that only
      prandom32() was the one using a pseudo-random number generator by
      prandom32's "p", and the result may be a very embarassing security
      exposure.  This concern was expressed by Theodore Ts'o.
      
      And furthermore, I'm going to introduce new functions for getting the
      requested number of pseudo-random bytes.  If I continue to use both
      prandom32 and random32 prefixes for these functions, the confusion
      is getting worse.
      
      As a result of this renaming, "prandom_" is the common prefix for
      pseudo-random number library.
      
      Currently, srandom32() and random32() are preserved because it is
      difficult to rename too many users at once.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
      Cc: David Laight <david.laight@aculab.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      496f2f93
    • J
      linux/compiler.h: add __must_hold macro for functions called with a lock held · 8529091e
      Josh Triplett 提交于
      linux/compiler.h has macros to denote functions that acquire or release
      locks, but not to denote functions called with a lock held that return
      with the lock still held.  Add a __must_hold macro to cover that case.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: NEd Cashin <ecashin@coraid.com>
      Tested-by: NEd Cashin <ecashin@coraid.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8529091e
    • G
      pidns: remove unused is_container_init() · a5ba911e
      Gao feng 提交于
      Since commit 1cdcbec1 ("CRED: Neuter sys_capset()")
      is_container_init() has no callers.
      Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ba911e
    • K
      exec: use -ELOOP for max recursion depth · d7402698
      Kees Cook 提交于
      To avoid an explosion of request_module calls on a chain of abusive
      scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
      as maximum recursion depth is hit, the error will fail all the way back
      up the chain, aborting immediately.
      
      This also has the side-effect of stopping the user's shell from attempting
      to reexecute the top-level file as a shell script. As seen in the
      dash source:
      
              if (cmd != path_bshell && errno == ENOEXEC) {
                      *argv-- = cmd;
                      *argv = cmd = path_bshell;
                      goto repeat;
              }
      
      The above logic was designed for running scripts automatically that lacked
      the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
      things continue to behave as the shell expects.
      
      Additionally, when tracking recursion, the binfmt handlers should not be
      involved. The recursion being tracked is the depth of calls through
      search_binary_handler(), so that function should be exclusively responsible
      for tracking the depth.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7402698
    • O
      ptrace: introduce PTRACE_O_EXITKILL · 992fb6e1
      Oleg Nesterov 提交于
      Ptrace jailers want to be sure that the tracee can never escape
      from the control. However if the tracer dies unexpectedly the
      tracee continues to run in potentially unsafe mode.
      
      Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits
      it sends SIGKILL to every tracee which has this bit set.
      
      Note that the new option is not equal to the last-option << 1.  Because
      currently all options have an event, and the new one starts the eventless
      group.  It uses the random 20 bit, so we have the room for 12 more events,
      but we can also add the new eventless options below this one.
      
      Suggested by Amnon Shiloh.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NAmnon Shiloh <u3557@miso.sublimeip.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Chris Evans <scarybeasts@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      992fb6e1
    • E
      kstrto*: add documentation · 4c925d60
      Eldad Zack 提交于
      As Bruce Fields pointed out, kstrto* is currently lacking kerneldoc
      comments.  This patch adds kerneldoc comments to common variants of
      kstrto*: kstrto(u)l, kstrto(u)ll and kstrto(u)int.
      Signed-off-by: NEldad Zack <eldad@fogrefinery.com>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c925d60
    • C
      compat: generic compat_sys_sched_rr_get_interval() implementation · 0ad50c38
      Catalin Marinas 提交于
      This function is used by sparc, powerpc tile and arm64 for compat support.
       The patch adds a generic implementation with a wrapper for PowerPC to do
      the u32->int sign extension.
      
      The reason for a single patch covering powerpc, tile, sparc and arm64 is
      to keep it bisectable, otherwise kernel building may fail with mismatched
      function declarations.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>  [for tile]
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ad50c38
    • O
      percpu_rw_semaphore: add lockdep annotations · 8ebe3473
      Oleg Nesterov 提交于
      Add lockdep annotations.  Not only this can help to find the potential
      problems, we do not want the false warnings if, say, the task takes two
      different percpu_rw_semaphore's for reading.  IOW, at least ->rw_sem
      should not use a single class.
      
      This patch exposes this internal lock to lockdep so that it represents the
      whole percpu_rw_semaphore.  This way we do not need to add another "fake"
      ->lockdep_map and lock_class_key.  More importantly, this also makes the
      output from lockdep much more understandable if it finds the problem.
      
      In short, with this patch from lockdep pov percpu_down_read() and
      percpu_up_read() acquire/release ->rw_sem for reading, this matches the
      actual semantics.  This abuses __up_read() but I hope this is fine and in
      fact I'd like to have down_read_no_lockdep() as well,
      percpu_down_read_recursive_readers() will need it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ebe3473
    • O
      percpu_rw_semaphore: kill ->writer_mutex, add ->write_ctr · 9390ef0c
      Oleg Nesterov 提交于
      percpu_rw_semaphore->writer_mutex was only added to simplify the initial
      rewrite, the only thing it protects is clear_fast_ctr() which otherwise
      could be called by multiple writers.  ->rw_sem is enough to serialize the
      writers.
      
      Kill this mutex and add "atomic_t write_ctr" instead.  The writers
      increment/decrement this counter, the readers check it is zero instead of
      mutex_is_locked().
      
      Move atomic_add(clear_fast_ctr(), slow_read_ctr) under down_write() to
      avoid the race with other writers.  This is a bit sub-optimal, only the
      first writer needs this and we do not need to exclude the readers at this
      stage.  But this is simple, we do not want another internal lock until we
      add more features.
      
      And this speeds up the write-contended case.  Before this patch the racing
      writers sleep in synchronize_sched_expedited() sequentially, with this
      patch multiple synchronize_sched_expedited's can "overlap" with each
      other.  Note: we can do more optimizations, this is only the first step.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9390ef0c
    • O
      percpu_rw_semaphore: reimplement to not block the readers unnecessarily · a1fd3e24
      Oleg Nesterov 提交于
      Currently the writer does msleep() plus synchronize_sched() 3 times to
      acquire/release the semaphore, and during this time the readers are
      blocked completely.  Even if the "write" section was not actually started
      or if it was already finished.
      
      With this patch down_write/up_write does synchronize_sched() twice and
      down_read/up_read are still possible during this time, just they use the
      slow path.
      
      percpu_down_write() first forces the readers to use rw_semaphore and
      increment the "slow" counter to take the lock for reading, then it
      takes that rw_semaphore for writing and blocks the readers.
      
      Also.  With this patch the code relies on the documented behaviour of
      synchronize_sched(), it doesn't try to pair synchronize_sched() with
      barrier.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1fd3e24
    • A
      string: introduce helper to get base file name from given path · b18888ab
      Andy Shevchenko 提交于
      There are several places in the kernel that use functionality like
      basename(3) with the exception: in case of '/foo/bar/' we expect to get an
      empty string.  Let's do it common helper for them.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18888ab
    • T
      backlight: add of_find_backlight_by_node() · 762a936f
      Thierry Reding 提交于
      This function finds the struct backlight_device for a given device tree
      node.  A dummy function is provided so that it safely compiles out if OF
      support is disabled.
      
      [akpm@linux-foundation.org: Don't use IS_ENABLED(CONFIG_OF)]
      Signed-off-by: NThierry Reding <thierry.reding@avionic-design.de>
      Acked-by: NJingoo Han <jg1.han@samsung.com>
      Reviewed-by: NGrant Likely <grant.likely@secretlab.ca>
      Cc: Thierry Reding <thierry.reding@avionic-design.de>
      Reviewed-by: NGrant Likely <grant.likely@secretlab.ca>
      Acked-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      762a936f
    • K
      drivers/video/backlight/lp855x_bl.c: use generic PWM functions · 8cc9764c
      Kim, Milo 提交于
      The LP855x family devices support the PWM input for the backlight control.
       Period of the PWM is configurable in the platform side.  Platform
      specific functions are unnecessary anymore because generic PWM functions
      are used inside the driver.
      
      (PWM input mode)
      To set the brightness, new lp855x_pwm_ctrl() is used.
      If a PWM device is not allocated, devm_pwm_get() is called.
      The PWM consumer name is from the chip name such as 'lp8550' and 'lp8556'.
      To get the brightness value, no additional handling is required.
      Just the value of 'props.brightness' is returned.
      
      If the PWM driver is not ready while initializing the LP855x driver, it's
      OK.  The PWM device can be retrieved later, when the brightness value is
      changed.
      
      Documentation is updated with an example.
      
      [akpm@linux-foundation.org: coding-style simplification, per Thierry]
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Cc: Thierry Reding <thierry.reding@avionic-design.de>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Bryan Wu <bryan.wu@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc9764c
    • A
      lseek: the "whence" argument is called "whence" · 965c8e59
      Andrew Morton 提交于
      But the kernel decided to call it "origin" instead.  Fix most of the
      sites.
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965c8e59
    • M
      include/linux/init.h: use the stringify operator for the __define_initcall macro · 7929d407
      Matthew Leach 提交于
      Currently the __define_initcall() macro takes three arguments, fn, id and
      level.  The level argument is exactly the same as the id argument but
      wrapped in quotes.  To overcome this need to specify three arguments to
      the __define_initcall macro, where one argument is the stringification of
      another, we can just use the stringification macro instead.
      Signed-off-by: NMatthew Leach <matthew@mattleach.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7929d407