1. 29 7月, 2014 5 次提交
  2. 23 7月, 2014 1 次提交
  3. 21 7月, 2014 1 次提交
  4. 16 7月, 2014 10 次提交
    • D
      locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER · 5db6c6fe
      Davidlohr Bueso 提交于
      Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
      encapsulate the dependencies for rwsem optimistic spinning.
      No logical changes here as it continues to depend on both
      SMP and the XADD algorithm variant.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      [ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
      Cc: aswin@hp.com
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5db6c6fe
    • P
      locking/mutex: Disable optimistic spinning on some architectures · 4badad35
      Peter Zijlstra 提交于
      The optimistic spin code assumes regular stores and cmpxchg() play nice;
      this is found to not be true for at least: parisc, sparc32, tile32,
      metag-lock1, arc-!llsc and hexagon.
      
      There is further wreckage, but this in particular seemed easy to
      trigger, so blacklist this.
      
      Opt in for known good archs.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: stable@vger.kernel.org
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4badad35
    • P
      locking/rwsem: Rename 'activity' to 'count' · 13b9a962
      Peter Zijlstra 提交于
      There are two definitions of struct rw_semaphore, one in linux/rwsem.h
      and one in linux/rwsem-spinlock.h.
      
      For some reason they have different names for the initial field. This
      makes it impossible to use C99 named initialization for
      __RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
      along with the structure definitions.
      
      The simpler patch is renaming the rwsem-spinlock variant to match the
      regular rwsem.
      
      This allows us to switch to C99 named initialization.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13b9a962
    • M
      sched: Fix possible divide by zero in avg_atom() calculation · b0ab99e7
      Mateusz Guzik 提交于
      proc_sched_show_task() does:
      
        if (nr_switches)
      	do_div(avg_atom, nr_switches);
      
      nr_switches is unsigned long and do_div truncates it to 32 bits, which
      means it can test non-zero on e.g. x86-64 and be truncated to zero for
      division.
      
      Fix the problem by using div64_ul() instead.
      
      As a side effect calculations of avg_atom for big nr_switches are now correct.
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0ab99e7
    • J
      locking/spinlocks/mcs: Micro-optimize osq_unlock() · 33ecd208
      Jason Low 提交于
      In the unlock function of the cancellable MCS spinlock, the first
      thing we do is to retrive the current CPU's osq node. However, due to
      the changes made in the previous patch, in the common case where the
      lock is not contended, we wouldn't need to access the current CPU's
      osq node anymore.
      
      This patch optimizes this by only retriving this CPU's osq node
      after we attempt the initial cmpxchg to unlock the osq and found
      that its contended.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33ecd208
    • J
      locking/spinlocks/mcs: Introduce and use init macro and function for osq locks · 4d9d951e
      Jason Low 提交于
      Currently, we initialize the osq lock by directly setting the lock's values. It
      would be preferable if we use an init macro to do the initialization like we do
      with other locks.
      
      This patch introduces and uses a macro and function for initializing the osq lock.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4d9d951e
    • J
      locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead · 90631822
      Jason Low 提交于
      The cancellable MCS spinlock is currently used to queue threads that are
      doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
      the lock would access and queue the local node corresponding to the CPU that
      it's running on. Currently, the cancellable MCS lock is implemented by using
      pointers to these nodes.
      
      In this patch, instead of operating on pointers to the per-cpu nodes, we
      store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
      A similar concept is used with the qspinlock.
      
      By operating on the CPU # of the nodes using atomic_t instead of pointers
      to those nodes, this can reduce the overhead of the cancellable MCS spinlock
      by 32 bits (on 64 bit systems).
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      90631822
    • J
      locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node() · 046a619d
      Jason Low 提交于
      Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
      named "optimistic_spin_queue". However, in a follow up patch in the series
      we will be introducing a new structure that serves as the new "handle" for
      the lock. It would make more sense if that structure is named
      "optimistic_spin_queue". Additionally, since the current use of the
      "optimistic_spin_queue" structure are  "nodes", it might be better if we
      rename them to "node" anyway.
      
      This preparatory patch renames all current "optimistic_spin_queue"
      to "optimistic_spin_node".
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      046a619d
    • J
      locking/rwsem: Allow conservative optimistic spinning when readers have lock · 37e95624
      Jason Low 提交于
      Commit 4fc828e2 ("locking/rwsem: Support optimistic spinning")
      introduced a major performance regression for workloads such as
      xfs_repair which mix read and write locking of the mmap_sem across
      many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
      than on 3.15 and using 20x more system CPU time.
      
      Perf profiles indicate in some workloads that significant time can
      be spent spinning on !owner. This is because we don't set the lock
      owner when readers(s) obtain the rwsem.
      
      In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
      return false if there is no lock owner. The rationale is that if we
      just entered the slowpath, yet there is no lock owner, then there is
      a possibility that a reader has the lock. To be conservative, we'll
      avoid spinning in these situations.
      
      This patch reduced the total run time of the xfs_repair workload from
      about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
      back to close to the same performance as on 3.15.
      
      Retesting of AIM7, which were some of the workloads used to test the
      original optimistic spinning code, confirmed that we still get big
      performance gains with optimistic spinning, even with this additional
      regression fix. Davidlohr found that while the 'custom' workload took
      a performance hit of ~-14% to throughput for >300 users with this
      additional patch, the overall gain with optimistic spinning is
      still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBoxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37e95624
    • M
      ring-buffer: Fix polling on trace_pipe · 97b8ee84
      Martin Lau 提交于
      ring_buffer_poll_wait() should always put the poll_table to its wait_queue
      even there is immediate data available.  Otherwise, the following epoll and
      read sequence will eventually hang forever:
      
      1. Put some data to make the trace_pipe ring_buffer read ready first
      2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
      3. epoll_wait()
      4. read(trace_pipe_fd) till EAGAIN
      5. Add some more data to the trace_pipe ring_buffer
      6. epoll_wait() -> this epoll_wait() will block forever
      
      ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
        ring_buffer_poll_wait() returns immediately without adding poll_table,
        which has poll_table->_qproc pointing to ep_poll_callback(), to its
        wait_queue.
      ~ During the epoll_wait() call in step 3 and step 6,
        ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
        because the poll_table->_qproc is NULL and it is how epoll works.
      ~ When there is new data available in step 6, ring_buffer does not know
        it has to call ep_poll_callback() because it is not in its wait queue.
        Hence, block forever.
      
      Other poll implementation seems to call poll_wait() unconditionally as the very
      first thing to do.  For example, tcp_poll() in tcp.c.
      
      Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
      
      Cc: stable@vger.kernel.org # 2.6.27
      Fixes: 2a2cc8f7 "ftrace: allow the event pipe to be polled"
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NMartin Lau <kafai@fb.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      97b8ee84
  5. 15 7月, 2014 6 次提交
  6. 08 7月, 2014 1 次提交
  7. 07 7月, 2014 1 次提交
    • Y
      workqueue: zero cpumask of wq_numa_possible_cpumask on init · 5a6024f1
      Yasuaki Ishimatsu 提交于
      When hot-adding and onlining CPU, kernel panic occurs, showing following
      call trace.
      
        BUG: unable to handle kernel paging request at 0000000000001d08
        IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
        PGD 0
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
         [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
         [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
         [<ffffffff811926f1>] new_slab+0x91/0x300
         [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
         [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
         [<ffffffff810a3c78>] ? load_balance+0x218/0x890
         [<ffffffff8101a679>] ? sched_clock+0x9/0x10
         [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
         [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
         [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
         [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
         [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
         [<ffffffff8105d0ec>] do_fork+0xbc/0x360
         [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
         [<ffffffff81086652>] kthreadd+0x2c2/0x300
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
         [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
      
      In my investigation, I found the root cause is wq_numa_possible_cpumask.
      All entries of wq_numa_possible_cpumask is allocated by
      alloc_cpumask_var_node(). And these entries are used without initializing.
      So these entries have wrong value.
      
      When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
      wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
      calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
      as follow:
      
      3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
      3593         if (wq_numa_enabled) {
      3594                 for_each_node(node) {
      3595                         if (cpumask_subset(pool->attrs->cpumask,
      3596                                            wq_numa_possible_cpumask[node])) {
      3597                                 pool->node = node;
      3598                                 break;
      3599                         }
      3600                 }
      3601         }
      
      But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
      node is selected. As a result, kernel panic occurs.
      
      By this patch, all entries of wq_numa_possible_cpumask are allocated by
      zalloc_cpumask_var_node to initialize them. And the panic disappeared.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
      5a6024f1
  8. 06 7月, 2014 1 次提交
  9. 04 7月, 2014 1 次提交
  10. 02 7月, 2014 3 次提交
    • J
      perf: Do not allow optimized switch for non-cloned events · 1f9a7268
      Jiri Olsa 提交于
      The context check in perf_event_context_sched_out allows
      non-cloned context to be part of the optimized schedule
      out switch.
      
      This could move non-cloned context into another workload
      child. Once this child exits, the context is closed and
      leaves all original (parent) events in closed state.
      
      Any other new cloned event will have closed state and not
      measure anything. And probably causing other odd bugs.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f9a7268
    • T
      cpuset: break kernfs active protection in cpuset_write_resmask() · 76bb5ab8
      Tejun Heo 提交于
      Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
      cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
      migrating tasks off a cpuset after new resources are added to it.
      
      As cpuset_hotplug_work calls into cgroup core via
      cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
      core locking from cpuset_write_resmak().  This used to be okay because
      cgroup interface files were protected by a different mutex; however,
      8353da1f ("cgroup: remove cgroup_tree_mutex") simplified the
      cgroup core locking and this dependency became a deadlock hazard -
      cgroup file removal performed under cgroup core lock tries to drain
      on-going file operation which is trying to flush cpuset_hotplug_work
      blocked on the same cgroup core lock.
      
      The locking simplification was done because kernfs added an a lot
      easier way to deal with circular dependencies involving kernfs active
      protection.  Let's use the same strategy in cpuset and break active
      protection in cpuset_write_resmask().  While it isn't the prettiest,
      this is a very rare, likely unique, situation which also goes away on
      the unified hierarchy.
      
      The commands to trigger the deadlock warning without the patch and the
      lockdep output follow.
      
       localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
       localhost:/ # mkdir /cpuset/tmp
       localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
       localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
       localhost:/ # echo $$ > /cpuset/tmp/tasks
       localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.16.0-rc1-0.1-default+ #7 Not tainted
        -------------------------------------------------------
        kworker/1:0/32649 is trying to acquire lock:
         (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
      
        but task is already holding lock:
         (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (cpuset_hotplug_work){+.+...}:
        ...
        -> #1 (s_active#175){++++.+}:
        ...
        -> #0 (cgroup_mutex){+.+.+.}:
        ...
      
        other info that might help us debug this:
      
        Chain exists of:
          cgroup_mutex --> s_active#175 --> cpuset_hotplug_work
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(cpuset_hotplug_work);
      				 lock(s_active#175);
      				 lock(cpuset_hotplug_work);
          lock(cgroup_mutex);
      
         *** DEADLOCK ***
      
        2 locks held by kworker/1:0/32649:
         #0:  ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
         #1:  (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
      
        stack backtrace:
        CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
       ...
        Call Trace:
         [<ffffffff815a5f78>] dump_stack+0x72/0x8a
         [<ffffffff810c263f>] print_circular_bug+0x10f/0x120
         [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0
         [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0
         [<ffffffff810c53d2>] __lock_acquire+0x382/0x660
         [<ffffffff810c57a9>] lock_acquire+0xf9/0x170
         [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380
         [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
         [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0
         [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180
         [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630
         [<ffffffff810854d4>] process_one_work+0x254/0x520
         [<ffffffff810875dd>] worker_thread+0x13d/0x3d0
         [<ffffffff8108e0c8>] kthread+0xf8/0x100
         [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLi Zefan <lizefan@huawei.com>
      Tested-by: NLi Zefan <lizefan@huawei.com>
      76bb5ab8
    • S
      tracing: Remove ftrace_stop/start() from reading the trace file · 099ed151
      Steven Rostedt (Red Hat) 提交于
      Disabling reading and writing to the trace file should not be able to
      disable all function tracing callbacks. There's other users today
      (like kprobes and perf). Reading a trace file should not stop those
      from happening.
      
      Cc: stable@vger.kernel.org # 3.0+
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      099ed151
  11. 01 7月, 2014 4 次提交
  12. 30 6月, 2014 2 次提交
    • L
      cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() · 3a32bd72
      Li Zefan 提交于
      We've converted cgroup to kernfs so cgroup won't be intertwined with
      vfs objects and locking, but there are dark areas.
      
      Run two instances of this script concurrently:
      
          for ((; ;))
          {
          	mount -t cgroup -o cpuacct xxx /cgroup
          	umount /cgroup
          }
      
      After a while, I saw two mount processes were stuck at retrying, because
      they were waiting for a subsystem to become free, but the root associated
      with this subsystem never got freed.
      
      This can happen, if thread A is in the process of killing superblock but
      hasn't called percpu_ref_kill(), and at this time thread B is mounting
      the same cgroup root and finds the root in the root list and performs
      percpu_ref_try_get().
      
      To fix this, we try to increase both the refcnt of the superblock and the
      percpu refcnt of cgroup root.
      
      v2:
      - we should try to get both the superblock refcnt and cgroup_root refcnt,
        because cgroup_root may have no superblock assosiated with it.
      - adjust/add comments.
      
      tj: Updated comments.  Renamed @sb to @pinned_sb.
      
      Cc: <stable@vger.kernel.org> # 3.15
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3a32bd72
    • L
      cgroup: fix mount failure in a corner case · 970317aa
      Li Zefan 提交于
        # cat test.sh
        #! /bin/bash
      
        mount -t cgroup -o cpu xxx /cgroup
        umount /cgroup
      
        mount -t cgroup -o cpu,cpuacct xxx /cgroup
        umount /cgroup
        # ./test.sh
        mount: xxx already mounted or /cgroup busy
        mount: according to mtab, xxx is already mounted on /cgroup
      
      It's because the cgroupfs_root of the first mount was under destruction
      asynchronously.
      
      Fix this by delaying and then retrying mount for this case.
      
      v3:
      - put the refcnt immediately after getting it. (Tejun)
      
      v2:
      - use percpu_ref_tryget_live() rather that introducing
        percpu_ref_alive(). (Tejun)
      - adjust comment.
      
      tj: Updated the comment a bit.
      
      Cc: <stable@vger.kernel.org> # 3.15
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      970317aa
  13. 25 6月, 2014 1 次提交
    • G
      cpuset,mempolicy: fix sleeping function called from invalid context · 391acf97
      Gu Zheng 提交于
      When runing with the kernel(3.15-rc7+), the follow bug occurs:
      [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
      [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
      [ 9969.441175] INFO: lockdep is turned off.
      [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
      [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
      [ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
      [ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
      [ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
      [ 9969.974071] Call Trace:
      [ 9970.003403]  [<ffffffff8162f523>] dump_stack+0x4d/0x66
      [ 9970.065074]  [<ffffffff8109995a>] __might_sleep+0xfa/0x130
      [ 9970.130743]  [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
      [ 9970.200638]  [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
      [ 9970.272610]  [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
      [ 9970.344584]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.409282]  [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
      [ 9970.471897]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.536585]  [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
      [ 9970.613763]  [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
      [ 9970.683660]  [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
      [ 9970.759795]  [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
      [ 9970.834885]  [<ffffffff8106a598>] do_fork+0xd8/0x380
      [ 9970.894375]  [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
      [ 9970.969470]  [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
      [ 9971.030011]  [<ffffffff81642009>] stub_clone+0x69/0x90
      [ 9971.091573]  [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
      
      The cause is that cpuset_mems_allowed() try to take
      mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
      __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
      under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
      protection region to protect the access to cpuset only in
      current_cpuset_is_being_rebound(). So that we can avoid this bug.
      
      This patch is a temporary solution that just addresses the bug
      mentioned above, can not fix the long-standing issue about cpuset.mems
      rebinding on fork():
      
      "When the forker's task_struct is duplicated (which includes
       ->mems_allowed) and it races with an update to cpuset_being_rebound
       in update_tasks_nodemask() then the task's mems_allowed doesn't get
       updated. And the child task's mems_allowed can be wrong if the
       cpuset's nodemask changes before the child has been added to the
       cgroup's tasklist."
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      391acf97
  14. 24 6月, 2014 3 次提交
    • A
      kernel/watchdog.c: print traces for all cpus on lockup detection · ed235875
      Aaron Tomlin 提交于
      A 'softlockup' is defined as a bug that causes the kernel to loop in
      kernel mode for more than a predefined period to time, without giving
      other tasks a chance to run.
      
      Currently, upon detection of this condition by the per-cpu watchdog
      task, debug information (including a stack trace) is sent to the system
      log.
      
      On some occasions, we have observed that the "victim" rather than the
      actual "culprit" (i.e.  the owner/holder of the contended resource) is
      reported to the user.  Often this information has proven to be
      insufficient to assist debugging efforts.
      
      To avoid loss of useful debug information, for architectures which
      support NMI, this patch makes it possible to improve soft lockup
      reporting.  This is accomplished by issuing an NMI to each cpu to obtain
      a stack trace.
      
      If NMI is not supported we just revert back to the old method.  A sysctl
      and boot-time parameter is available to toggle this feature.
      
      [dzickus@redhat.com: add CONFIG_SMP in certain areas]
      [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
      [mq@suse.cz: fix warning]
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed235875
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
    • D
      kernel/watchdog.c: remove preemption restrictions when restarting lockup detector · bde92cf4
      Don Zickus 提交于
      Peter Wu noticed the following splat on his machine when updating
      /proc/sys/kernel/watchdog_thresh:
      
        BUG: sleeping function called from invalid context at mm/slub.c:965
        in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
        3 locks held by init/1:
         #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180
         #1:  (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110
         #2:  (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80
        Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110
      
        CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x4e/0x7a
          __might_sleep+0x11d/0x190
          kmem_cache_alloc_trace+0x4e/0x1e0
          perf_event_alloc+0x55/0x440
          perf_event_create_kernel_counter+0x26/0xe0
          watchdog_nmi_enable+0x75/0x140
          update_timers_all_cpus+0x53/0xa0
          proc_dowatchdog+0xe4/0x110
          proc_sys_call_handler+0xb3/0xc0
          proc_sys_write+0x14/0x20
          vfs_write+0xad/0x180
          SyS_write+0x49/0xb0
          system_call_fastpath+0x16/0x1b
        NMI watchdog: disabled (cpu0): hardware events not enabled
      
      What happened is after updating the watchdog_thresh, the lockup detector
      is restarted to utilize the new value.  Part of this process involved
      disabling preemption.  Once preemption was disabled, perf tried to
      allocate a new event (as part of the restart).  This caused the above
      BUG_ON as you can't sleep with preemption disabled.
      
      The preemption restriction seemed agressive as we are not doing anything
      on that particular cpu, but with all the online cpus (which are
      protected by the get_online_cpus lock).  Remove the restriction and the
      BUG_ON goes away.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NPeter Wu <peter@lekensteyn.nl>
      Tested-by: NPeter Wu <peter@lekensteyn.nl>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[3.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde92cf4