1. 18 6月, 2013 1 次提交
    • T
      cgroup: disallow rename(2) if sane_behavior · 6db8e85c
      Tejun Heo 提交于
      cgroup's rename(2) isn't a proper migration implementation - it can't
      move the cgroup to a different parent in the hierarchy.  All it can do
      is swapping the name string for that cgroup.  This isn't useful and
      can mislead users to think that cgroup supports proper cgroup-level
      migration.  Disallow rename(2) if sane_behavior.
      
      v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
          return value when ->rename is not implemented as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6db8e85c
  2. 14 6月, 2013 10 次提交
    • T
      cgroup: use percpu refcnt for cgroup_subsys_states · d3daf28d
      Tejun Heo 提交于
      A css (cgroup_subsys_state) is how each cgroup is represented to a
      controller.  As such, it can be used in hot paths across the various
      subsystems different controllers are associated with.
      
      One of the common operations is reference counting, which up until now
      has been implemented using a global atomic counter and can have
      significant adverse impact on scalability.  For example, css refcnt
      can be gotten and put multiple times by blkcg for each IO request.
      For highops configurations which try to do as much per-cpu as
      possible, the global frequent refcnting can be very expensive.
      
      In general, given the various and hugely diverse paths css's end up
      being used from, we need to make it cheap and highly scalable.  In its
      usage, css refcnting isn't very different from module refcnting.
      
      This patch converts css refcnting to use the recently added
      percpu_ref.  css_get/tryget/put() directly maps to the matching
      percpu_ref operations and the deactivation logic is no longer
      necessary as percpu_ref already has refcnt killing.
      
      The only complication is that as the refcnt is per-cpu,
      percpu_ref_kill() in itself doesn't ensure that further tryget
      operations will fail, which we need to guarantee before invoking
      ->css_offline()'s.  This is resolved collecting kill confirmation
      using percpu_ref_kill_and_confirm() and initiating the offline phase
      of destruction after all css refcnt's are confirmed to be seen as
      killed on all CPUs.  The previous patches already splitted destruction
      into two phases, so percpu_ref_kill_and_confirm() can be hooked up
      easily.
      
      This patch removes css_refcnt() which is used for rcu dereference
      sanity check in css_id().  While we can add a percpu refcnt API to ask
      the same question, css_id() itself is scheduled to be removed fairly
      soon, so let's not bother with it.  Just drop the sanity check and use
      rcu_dereference_raw() instead.
      
      v2: - init_cgroup_css() was calling percpu_ref_init() without checking
            the return value.  This causes two problems - the obvious lack
            of error handling and percpu_ref_init() being called from
            cgroup_init_subsys() before the allocators are up, which
            triggers warnings but doesn't cause actual problems as the
            refcnt isn't used for roots anyway.  Fix both by moving
            percpu_ref_init() to cgroup_create().
      
          - The base references were put too early by
            percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
            refs one extra time.  This wasn't noticeable because css's go
            through another RCU grace period before being freed.  Update
            cgroup_destroy_locked() to grab an extra reference before
            killing the refcnts.  This problem was noticed by Kent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKent Overstreet <koverstreet@google.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Alasdair G. Kergon" <agk@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      d3daf28d
    • T
      cgroup: split cgroup destruction into two steps · ea15f8cc
      Tejun Heo 提交于
      Split cgroup_destroy_locked() into two steps and put the latter half
      into cgroup_offline_fn() which is executed from a work item.  The
      latter half is responsible for offlining the css's, removing the
      cgroup from internal lists, and propagating release notification to
      the parent.  The separation is to allow using percpu refcnt for css.
      
      Note that this allows for other cgroup operations to happen between
      the first and second halves of destruction, including creating a new
      cgroup with the same name.  As the target cgroup is marked DEAD in the
      first half and cgroup internals don't care about the names of cgroups,
      this should be fine.  A comment explaining this will be added by the
      next patch which implements the actual percpu refcnting.
      
      As RCU freeing is guaranteed to happen after the second step of
      destruction, we can use the same work item for both.  This patch
      renames cgroup->free_work to ->destroy_work and uses it for both
      purposes.  INIT_WORK() is now performed right before queueing the work
      item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ea15f8cc
    • T
      cgroup: reorder the operations in cgroup_destroy_locked() · 455050d2
      Tejun Heo 提交于
      This patch reorders the operations in cgroup_destroy_locked() such
      that the userland visible parts happen before css offlining and
      removal from the ->sibling list.  This will be used to make css use
      percpu refcnt.
      
      While at it, split out CGRP_DEAD related comment from the refcnt
      deactivation one and correct / clarify how different guarantees are
      met.
      
      While this patch changes the specific order of operations, it
      shouldn't cause any noticeable behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      455050d2
    • T
      cgroup: remove cgroup->count and use · 6f3d828f
      Tejun Heo 提交于
      cgroup->count tracks the number of css_sets associated with the cgroup
      and used only to verify that no css_set is associated when the cgroup
      is being destroyed.  It's superflous as the destruction path can
      simply check whether cgroup->cset_links is empty instead.
      
      Drop cgroup->count and check ->cset_links directly from
      cgroup_destroy_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6f3d828f
    • T
      cgroup: drop unnecessary RCU dancing from __put_css_set() · ddd69148
      Tejun Heo 提交于
      __put_css_set() does RCU read access on @cgrp across dropping
      @cgrp->count so that it can continue accessing @cgrp even if the count
      reached zero and destruction of the cgroup commenced.  Given that both
      sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
      is unnecessary.  Just making cgroup_destroy_locked() grab css_set_lock
      while checking @cgrp->count is enough.
      
      Remove the RCU read locking from __put_css_set() and make
      cgroup_destroy_locked() read-lock css_set_lock when checking
      @cgrp->count.  This will also allow removing @cgrp->count.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ddd69148
    • T
      cgroup: rename CGRP_REMOVED to CGRP_DEAD · 54766d4a
      Tejun Heo 提交于
      We will add another flag indicating that the cgroup is in the process
      of being killed.  REMOVING / REMOVED is more difficult to distinguish
      and cgroup_is_removing()/cgroup_is_removed() are a bit awkward.  Also,
      later percpu_ref usage will involve "kill"ing the refcnt.
      
       s/CGRP_REMOVED/CGRP_DEAD/
       s/cgroup_is_removed()/cgroup_is_dead()
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      54766d4a
    • T
      cgroup: use kzalloc() instead of kmalloc() · f4f4be2b
      Tejun Heo 提交于
      There's no point in using kmalloc() instead of the clearing variant
      for trivial stuff.  We can live dangerously elsewhere.  Use kzalloc()
      instead and drop 0 inits.
      
      While at it, do trivial code reorganization in cgroup_file_open().
      
      This patch doesn't introduce any functional changes.
      
      v2: I was caught in the very distant past where list_del() didn't
          poison and the initial version converted list_del()s to
          list_del_init()s too.  Li and Kent took me out of the stasis
          chamber.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <koverstreet@google.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f4f4be2b
    • T
      cgroup: bring some sanity to naming around cg_cgroup_link · 69d0206c
      Tejun Heo 提交于
      cgroups and css_sets are mapped M:N and this M:N mapping is
      represented by struct cg_cgroup_link which forms linked lists on both
      sides.  The naming around this mapping is already confusing and struct
      cg_cgroup_link exacerbates the situation quite a bit.
      
      >From cgroup side, it starts off ->css_sets and runs through
      ->cgrp_link_list.  From css_set side, it starts off ->cg_links and
      runs through ->cg_link_list.  This is rather reversed as
      cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
      Also, this is the only place which is still using the confusing "cg"
      for css_sets.  This patch cleans it up a bit.
      
      * s/cgroup->css_sets/cgroup->cset_links/
        s/css_set->cg_links/css_set->cgrp_links/
        s/cgroup_iter->cg_link/cgroup_iter->cset_link/
      
      * s/cg_cgroup_link/cgrp_cset_link/
      
      * s/cgrp_cset_link->cg/cgrp_cset_link->cset/
        s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
        s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
      
      * s/init_css_set_link/init_cgrp_cset_link/
        s/free_cg_links/free_cgrp_cset_links/
        s/allocate_cg_links/allocate_cgrp_cset_links/
      
      * s/cgl[12]/link[12]/ in compare_css_sets()
      
      * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
        adustments.
      
      * Comment and whiteline adjustments.
      
      After the changes, we have
      
      	list_for_each_entry(link, &cont->cset_links, cset_link) {
      		struct css_set *cset = link->cset;
      
      instead of
      
      	list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
      		struct css_set *cset = link->cg;
      
      This patch is purely cosmetic.
      
      v2: Fix broken sentences in the patch description.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      69d0206c
    • T
      cgroup: consistently use @cset for struct css_set variables · 5abb8855
      Tejun Heo 提交于
      cgroup.c uses @cg for most struct css_set variables, which in itself
      could be a bit confusing, but made much worse by the fact that there
      are places which use @cg for struct cgroup variables.
      compare_css_sets() epitomizes this confusion - @[old_]cg are struct
      css_set while @cg[12] are struct cgroup.
      
      It's not like the whole deal with cgroup, css_set and cg_cgroup_link
      isn't already confusing enough.  Let's give it some sanity by
      uniformly using @cset for all struct css_set variables.
      
      * s/cg/cset/ for all css_set variables.
      
      * s/oldcg/old_cset/ s/oldcgrp/old_cgrp/.  The same for the ones
        prefixed with "new".
      
      * s/cg/cgrp/ for cgroup variables in compare_css_sets().
      
      * s/css/cset/ for the cgroup variable in task_cgroup_from_root().
      
      * Whiteline adjustments.
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5abb8855
    • T
      cgroup: remove now unused css_depth() · 3fc3db9a
      Tejun Heo 提交于
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3fc3db9a
  3. 06 6月, 2013 3 次提交
    • T
      cgroup: clean up the cftype array for the base cgroup files · d5c56ced
      Tejun Heo 提交于
      * Rename it from files[] (really?) to cgroup_base_files[].
      
      * Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
        used inconsistently.  Just use "cgroup." directly.
      
      * Collect insane files at the end.  Note that only the insane ones are
        missing "cgroup." prefix.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d5c56ced
    • T
      cgroup: mark "notify_on_release" and "release_agent" cgroup files insane · cc5943a7
      Tejun Heo 提交于
      The empty cgroup notification mechanism currently implemented in
      cgroup is tragically outdated.  Forking and execing userland process
      stopped being a viable notification mechanism more than a decade ago.
      We're gonna have a saner mechanism.  Let's make it clear that this
      abomination is going away.
      
      Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cc5943a7
    • T
      cgroup: mark "tasks" cgroup file as insane · f12dc020
      Tejun Heo 提交于
      Some resources controlled by cgroup aren't per-task and cgroup core
      allowing threads of a single thread_group to be in different cgroups
      forced memcg do explicitly find the group leader and use it.  This is
      gonna be nasty when transitioning to unified hierarchy and in general
      we don't want and won't support granularity finer than processes.
      
      Mark "tasks" with CFTYPE_INSANE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: cgroups@vger.kernel.org
      Cc: Vivek Goyal <vgoyal@redhat.com>
      f12dc020
  4. 29 5月, 2013 1 次提交
    • J
      cgroup: warn about mismatching options of a new mount of an existing hierarchy · 2a0ff3fb
      Jeff Liu 提交于
      With the new __DEVEL__sane_behavior mount option was introduced,
      if the root cgroup is alive with no xattr function, to mount a
      new cgroup with xattr will be rejected in terms of design which
      just fine.  However, if the root cgroup does not mounted with
      __DEVEL__sane_hehavior, to create a new cgroup with xattr option
      will succeed although after that the EA function does not works
      as expected but will get ENOTSUPP for setting up attributes under
      either cgroup. e.g.
      
      setfattr: /cgroup2/test: Operation not supported
      
      Instead of keeping silence in this case, it's better to drop a log
      entry in warning level.  That would be helpful to understand the
      reason behind the scene from the user's perspective, and this is
      essentially an improvement does not break the backward compatibilities.
      
      With this fix, above mount attemption will keep up works as usual but
      the following line cound be found at the system log:
      
      [ ...] cgroup: new mount options do not match the existing superblock
      
      tj: minor formatting / message updates.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reported-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      2a0ff3fb
  5. 28 5月, 2013 1 次提交
  6. 25 5月, 2013 1 次提交
  7. 24 5月, 2013 4 次提交
    • T
      cgroup: update iterators to use cgroup_next_sibling() · 75501a6d
      Tejun Heo 提交于
      This patch converts cgroup_for_each_child(),
      cgroup_next_descendant_pre/post() and thus
      cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
      instead of manually dereferencing ->sibling.next.
      
      The only reason the iterators couldn't allow dropping RCU read lock
      while iteration is in progress was because they couldn't determine the
      next sibling safely once RCU read lock is dropped.  Using
      cgroup_next_sibling() removes that problem and enables all iterators
      to allow dropping RCU read lock in the middle.  Comments are updated
      accordingly.
      
      This makes the iterators easier to use and will simplify controllers.
      
      Note that @cgroup argument is renamed to @cgrp in
      cgroup_for_each_child() because it conflicts with "struct cgroup" used
      in the new macro body.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      75501a6d
    • T
      cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() · 53fa5261
      Tejun Heo 提交于
      Currently, there's no easy way to find out the next sibling cgroup
      unless it's known that the current cgroup is accessed from the
      parent's children list in a single RCU critical section.  This in turn
      forces all iterators to require whole iteration to be enclosed in a
      single RCU critical section, which sometimes is too restrictive.  This
      patch implements cgroup_next_sibling() which can reliably determine
      the next sibling regardless of the state of the current cgroup as long
      as it's accessible.
      
      It currently is impossible to determine the next sibling after
      dropping RCU read lock because the cgroup being iterated could be
      removed anytime and if RCU read lock is dropped, nothing guarantess
      its ->sibling.next pointer is accessible.  A removed cgroup would
      continue to point to its next sibling for RCU accesses but stop
      receiving updates from the sibling.  IOW, the next sibling could be
      removed and then complete its grace period while RCU read lock is
      dropped, making it unsafe to dereference ->sibling.next after dropping
      and re-acquiring RCU read lock.
      
      This can be solved by adding a way to traverse to the next sibling
      without dereferencing ->sibling.next.  This patch adds a monotonically
      increasing cgroup serial number, cgroup->serial_nr, which guarantees
      that all cgroup->children lists are kept in increasing serial_nr
      order.  A new function, cgroup_next_sibling(), is implemented, which,
      if CGRP_REMOVED is not set on the current cgroup, follows
      ->sibling.next; otherwise, traverses the parent's ->children list
      until it sees a sibling with higher ->serial_nr.
      
      This allows the function to always return the next sibling regardless
      of the state of the current cgroup without adding overhead in the fast
      path.
      
      Further patches will update the iterators to use cgroup_next_sibling()
      so that they allow dropping RCU read lock and blocking while iteration
      is in progress which in turn will be used to simplify controllers.
      
      v2: Typo fix as per Serge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      53fa5261
    • T
      cgroup: make cgroup_is_removed() static · bdc7119f
      Tejun Heo 提交于
      cgroup_is_removed() no longer has external users and it shouldn't grow
      any - controllers should deal with cgroup_subsys_state on/offline
      state instead of cgroup removal state.  Make it static.
      
      While at it, make it return bool.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bdc7119f
    • T
      cgroup: fix a subtle bug in descendant pre-order walk · 7805d000
      Tejun Heo 提交于
      When cgroup_next_descendant_pre() initiates a walk, it checks whether
      the subtree root doesn't have any children and if not returns NULL.
      Later code assumes that the subtree isn't empty.  This is broken
      because the subtree may become empty inbetween, which can lead to the
      traversal escaping the subtree by walking to the sibling of the
      subtree root.
      
      There's no reason to have the early exit path.  Remove it along with
      the later assumption that the subtree isn't empty.  This simplifies
      the code a bit and fixes the subtle bug.
      
      While at it, fix the comment of cgroup_for_each_descendant_pre() which
      was incorrectly referring to ->css_offline() instead of
      ->css_online().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: stable@vger.kernel.org
      7805d000
  8. 23 5月, 2013 1 次提交
    • S
      tracing: Fix crash when ftrace=nop on the kernel command line · ca164318
      Steven Rostedt (Red Hat) 提交于
      If ftrace=<tracer> is on the kernel command line, when that tracer is
      registered, it will be initiated by tracing_set_tracer() to execute that
      tracer.
      
      The nop tracer is just a stub tracer that is used to have no tracer
      enabled. It is assigned at early bootup as it is the default tracer.
      
      But if ftrace=nop is on the kernel command line, the registering of the
      nop tracer will call tracing_set_tracer() which will try to execute
      the nop tracer. But it expects tr->current_trace to be assigned something
      as it usually is assigned to the nop tracer. As it hasn't been assigned
      to anything yet, it causes the system to crash.
      
      The simple fix is to move the tr->current_trace = nop before registering
      the nop tracer. The functionality is still the same as the nop tracer
      doesn't do anything anyway.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ca164318
  9. 18 5月, 2013 1 次提交
    • Y
      x86, range: fix missing merge during add range · fbe06b7b
      Yinghai Lu 提交于
      Christian found v3.9 does not work with E350 with EFI is enabled.
      
      [    1.658832] Trying to unpack rootfs image as initramfs...
      [    1.679935] BUG: unable to handle kernel paging request at ffff88006e3fd000
      [    1.686940] IP: [<ffffffff813661df>] memset+0x1f/0xb0
      [    1.692010] PGD 1f77067 PUD 1f7a067 PMD 61420067 PTE 0
      
      but early memtest report all memory could be accessed without problem.
      
      early page table is set in following sequence:
      [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
      [    0.000000] init_memory_mapping: [mem 0x6e600000-0x6e7fffff]
      [    0.000000] init_memory_mapping: [mem 0x6c000000-0x6e5fffff]
      [    0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff]
      [    0.000000] init_memory_mapping: [mem 0x6e800000-0x6ea07fff]
      but later efi_enter_virtual_mode try set mapping again wrongly.
      [    0.010644] pid_max: default: 32768 minimum: 301
      [    0.015302] init_memory_mapping: [mem 0x640c5000-0x6e3fcfff]
      that means it fails with pfn_range_is_mapped.
      
      It turns out that we have a bug in add_range_with_merge and it does not
      merge range properly when new add one fill the hole between two exsiting
      ranges. In the case when [mem 0x00100000-0x6bffffff] is the hole between
      [mem 0x00000000-0x000fffff] and [mem 0x6c000000-0x6e7fffff].
      
      Fix the add_range_with_merge by calling itself recursively.
      Reported-by: N"Christian König" <christian.koenig@amd.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/CAE9FiQVofGoSk7q5-0irjkBxemqK729cND4hov-1QCBJDhxpgQ@mail.gmail.com
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      fbe06b7b
  10. 17 5月, 2013 3 次提交
  11. 16 5月, 2013 7 次提交
    • M
      tracing: Return -EBUSY when event_enable_func() fails to get module · 6ed01066
      Masami Hiramatsu 提交于
      Since try_module_get() returns false( = 0) when it fails to
      pindown a module, event_enable_func() returns 0 which means
      "succeed". This can cause a kernel panic when the entry
      is removed, because the event is already released.
      
      This fixes the bug by returning -EBUSY, because the reason
      why it fails is that the module is being removed at that time.
      
      Link: http://lkml.kernel.org/r/20130516114848.13508.97899.stgit@mhiramat-M0-7522
      
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tom Zanussi <tom.zanussi@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6ed01066
    • T
      workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() · 1be0c25d
      Tejun Heo 提交于
      wq_numa_init() builds per-node cpumasks which are later used to make
      unbound workqueues NUMA-aware.  The cpumasks are allocated using
      alloc_cpumask_var_node() for all possible nodes.  Unfortunately, on
      machines with off-line nodes, this leads to NUMA-aware allocations on
      existing bug offline nodes, which in turn triggers BUG in the memory
      allocation code.
      
      Fix it by using NUMA_NO_NODE for cpumask allocations for offline
      nodes.
      
        kernel BUG at include/linux/gfp.h:323!
        invalid opcode: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
        Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
        task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
        RIP: 0010:[<ffffffff8117495d>]  [<ffffffff8117495d>] new_slab+0x2ad/0x340
        RSP: 0000:ffff880234603bf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
        RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
        RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
        R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
        FS:  0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
         ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
         ffff880237816140 0000000000000000 ffff88023740e1c0
        Call Trace:
         [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2
         [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200
         [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90
         [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be
         [<ffffffff81a0bec8>] init_workqueues+0x64/0x341
         [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0
         [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec
         [<ffffffff815d50de>] kernel_init+0xe/0xf0
         [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0
        Code: 45  84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-Tested-by: NLingzhu Xiang <lxiang@redhat.com>
      1be0c25d
    • M
      tracing/kprobes: Make print_*probe_event static · b62fdd97
      Masami Hiramatsu 提交于
      According to sparse warning, print_*probe_event static because
      those functions are not directly called from outside.
      
      Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522
      
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tom Zanussi <tom.zanussi@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b62fdd97
    • M
      tracing/kprobes: Fix a sparse warning for incorrect type in assignment · 3d1fc7b0
      Masami Hiramatsu 提交于
      Fix a sparse warning about the rcu operated pointer is
      defined without __rcu address space.
      
      Link: http://lkml.kernel.org/r/20130513115837.6545.23322.stgit@mhiramat-M0-7522
      
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tom Zanussi <tom.zanussi@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3d1fc7b0
    • M
      tracing/kprobes: Use rcu_dereference_raw for tp->files · c02c7e65
      Masami Hiramatsu 提交于
      Use rcu_dereference_raw() for accessing tp->files. Because the
      write-side uses rcu_assign_pointer() for memory barrier,
      the read-side also has to use rcu_dereference_raw() with
      read memory barrier.
      
      Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522
      
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tom Zanussi <tom.zanussi@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c02c7e65
    • S
      tracing: Fix leaks of filter preds · 60705c89
      Steven Rostedt (Red Hat) 提交于
      Special preds are created when folding a series of preds that
      can be done in serial. These are allocated in an ops field of
      the pred structure. But they were never freed, causing memory
      leaks.
      
      This was discovered using the kmemleak checker:
      
      unreferenced object 0xffff8800797fd5e0 (size 32):
        comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
        hex dump (first 32 bytes):
          00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
          [<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
          [<ffffffff81120e68>] __kmalloc+0xd7/0x125
          [<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
          [<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
          [<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
          [<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
          [<ffffffff810d50b5>] create_filter+0x4e/0x8b
          [<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
          [<ffffffff8100028d>] do_one_initcall+0xa0/0x137
          [<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
          [<ffffffff814b24b7>] kernel_init+0xe/0xdb
          [<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: stable@vger.kernel.org # 2.6.39+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      60705c89
    • S
      rcu: Don't allocate bootmem from rcu_init() · 615ee544
      Sasha Levin 提交于
      When rcu_init() is called we already have slab working, allocating
      bootmem at that point results in warnings and an allocation from
      slab.  This commit therefore changes alloc_bootmem_cpumask_var() to
      alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
      from rcu_init().
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Tested-by: NRobin Holt <holt@sgi.com>
      
      [paulmck: convert to zalloc_cpumask_var(), as suggested by Yinghai Lu.]
      615ee544
  12. 15 5月, 2013 7 次提交