1. 26 3月, 2014 1 次提交
    • S
      tracing: Fix traceon trigger condition to actually turn tracing on · 2c4a33ab
      Steven Rostedt (Red Hat) 提交于
      While working on my tutorial for 2014 Linux Collaboration Summit
      I found that the traceon trigger did not work when conditions were
      used. The other triggers worked fine though. Looking into it, it
      is because of the way the triggers use the ring buffer to store
      the fields it will use for the condition. But if tracing is off, nothing
      is stored in the buffer, and the tracepoint exits before calling the
      trigger to test the condition. This is fine for all the triggers that
      only work when tracing is on, but for traceon trigger that is to
      work when tracing is off, nothing happens.
      
      The fix is simple, just use a temp ring buffer to record the event
      if tracing is off and the event has a trace event conditional trigger
      enabled. The rest of the tracepoint code will work just fine, but
      the tracepoint wont be recorded in the other buffers.
      
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2c4a33ab
  2. 21 3月, 2014 2 次提交
    • L
      futex: revert back to the explicit waiter counting code · 11d4616b
      Linus Torvalds 提交于
      Srikar Dronamraju reports that commit b0c29f79 ("futexes: Avoid
      taking the hb->lock if there's nothing to wake up") causes java threads
      getting stuck on futexes when runing specjbb on a power7 numa box.
      
      The cause appears to be that the powerpc spinlocks aren't using the same
      ticket lock model that we use on x86 (and other) architectures, which in
      turn result in the "spin_is_locked()" test in hb_waiters_pending()
      occasionally reporting an unlocked spinlock even when there are pending
      waiters.
      
      So this reinstates Davidlohr Bueso's original explicit waiter counting
      code, which I had convinced Davidlohr to drop in favor of figuring out
      the pending waiters by just using the existing state of the spinlock and
      the wait queue.
      Reported-and-tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Original-code-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d4616b
    • V
      tracing: Fix array size mismatch in format string · 87291347
      Vaibhav Nagarnaik 提交于
      In event format strings, the array size is reported in two locations.
      One in array subscript and then via the "size:" attribute. The values
      reported there have a mismatch.
      
      For e.g., in sched:sched_switch the prev_comm and next_comm character
      arrays have subscript values as [32] where as the actual field size is
      16.
      
      name: sched_switch
      ID: 301
      format:
              field:unsigned short common_type;       offset:0;       size:2; signed:0;
              field:unsigned char common_flags;       offset:2;       size:1; signed:0;
              field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
              field:int common_pid;   offset:4;       size:4; signed:1;
      
              field:char prev_comm[32];       offset:8;       size:16;        signed:1;
              field:pid_t prev_pid;   offset:24;      size:4; signed:1;
              field:int prev_prio;    offset:28;      size:4; signed:1;
              field:long prev_state;  offset:32;      size:8; signed:1;
              field:char next_comm[32];       offset:40;      size:16;        signed:1;
              field:pid_t next_pid;   offset:56;      size:4; signed:1;
              field:int next_prio;    offset:60;      size:4; signed:1;
      
      After bisection, the following commit was blamed:
      92edca07 tracing: Use direct field, type and system names
      
      This commit removes the duplication of strings for field->name and
      field->type assuming that all the strings passed in
      __trace_define_field() are immutable. This is not true for arrays, where
      the type string is created in event_storage variable and field->type for
      all array fields points to event_storage.
      
      Use __stringify() to create a string constant for the type string.
      
      Also, get rid of event_storage and event_storage_mutex that are not
      needed anymore.
      
      also, an added benefit is that this reduces the overhead of events a bit more:
      
         text    data     bss     dec     hex filename
      8424787 2036472 1302528 11763787         b3804b vmlinux
      8420814 2036408 1302528 11759750         b37086 vmlinux.patched
      
      Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com
      
      Cc: Laurent Chavey <chavey@google.com>
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NVaibhav Nagarnaik <vnagarnaik@google.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      87291347
  3. 19 3月, 2014 1 次提交
  4. 11 3月, 2014 4 次提交
  5. 09 3月, 2014 1 次提交
  6. 04 3月, 2014 1 次提交
  7. 01 3月, 2014 1 次提交
  8. 28 2月, 2014 1 次提交
  9. 27 2月, 2014 9 次提交
    • L
      cpuset: fix a race condition in __cpuset_node_allowed_softwall() · 99afb0fd
      Li Zefan 提交于
      It's not safe to access task's cpuset after releasing task_lock().
      Holding callback_mutex won't help.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      99afb0fd
    • L
      cpuset: fix a locking issue in cpuset_migrate_mm() · 47295830
      Li Zefan 提交于
      I can trigger a lockdep warning:
      
        # mount -t cgroup -o cpuset xxx /cgroup
        # mkdir /cgroup/cpuset
        # mkdir /cgroup/tmp
        # echo 0 > /cgroup/tmp/cpuset.cpus
        # echo 0 > /cgroup/tmp/cpuset.mems
        # echo 1 > /cgroup/tmp/cpuset.memory_migrate
        # echo $$ > /cgroup/tmp/tasks
        # echo 1 > /cgruop/tmp/cpuset.mems
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        3.14.0-rc1-0.1-default+ #32 Not tainted
        -------------------------------
        include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
        ...
          [<ffffffff81582174>] dump_stack+0x72/0x86
          [<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140
          [<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0
        ...
      
      We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
      we hold cpuset_mutex, which causes task_css() to complain.
      
      This is not a false-positive but a real issue.
      
      Holding cpuset_mutex won't prevent a task from migrating to another
      cpuset, and it won't prevent the original task->cgroup from destroying
      during this change.
      
      Fixes: 5d21cc2d (cpuset: replace cgroup_mutex locking with cpuset internal locking)
      Cc: <stable@vger.kernel.org> # 3.9+
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Sigend-off-by: NTejun Heo <tj@kernel.org>
      47295830
    • R
      genirq: Include missing header file in irqdomain.c · 64be38ab
      Rashika Kheria 提交于
      Include appropriate header file include/linux/of_irq.h in
      kernel/irq/irqdomain.c because it contains prototype definition of
      function define in kernel/irq/irqdomain.c.
      
      This eliminates the following warning in kernel/irq/irqdomain.c:
      kernel/irq/irqdomain.c:468:14: warning: no previous prototype for ‘irq_create_of_mapping’ [-Wmissing-prototypes]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Link: http://lkml.kernel.org/r/eb89aebea7ff1a46122918ac389ebecf8248be9a.1393493276.git.rashika.kheria@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      64be38ab
    • P
      perf: Fix hotplug splat · e3703f8c
      Peter Zijlstra 提交于
      Drew Richardson reported that he could make the kernel go *boom* when hotplugging
      while having perf events active.
      
      It turned out that when you have a group event, the code in
      __perf_event_exit_context() fails to remove the group siblings from
      the context.
      
      We then proceed with destroying and freeing the event, and when you
      re-plug the CPU and try and add another event to that CPU, things go
      *boom* because you've still got dead entries there.
      Reported-by: NDrew Richardson <drew.richardson@arm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3703f8c
    • J
      sched/deadline: Prevent rt_time growth to infinity · faa59937
      Juri Lelli 提交于
      Kirill Tkhai noted:
      
        Since deadline tasks share rt bandwidth, we must care about
        bandwidth timer set. Otherwise rt_time may grow up to infinity
        in update_curr_dl(), if there are no other available RT tasks
        on top level bandwidth.
      
      RT task were in fact throttled right after they got enqueued,
      and never executed again (rt_time never again went below rt_runtime).
      
      Peter then proposed to accrue DL execution on rt_time only when
      rt timer is active, and proposed a patch (this patch is a slight
      modification of that) to implement that behavior. While this
      solves Kirill problem, it has a drawback.
      
      Indeed, Kirill noted again:
      
        It looks we may get into a situation, when all CPU time is shared
        between RT and DL tasks:
      
        rt_runtime = n
        rt_period  = 2n
      
        | RT working, DL sleeping  | DL working, RT sleeping      |
        -----------------------------------------------------------
        | (1)     duration = n     | (2)     duration = n         | (repeat)
        |--------------------------|------------------------------|
        | (rt_bw timer is running) | (rt_bw timer is not running) |
      
        No time for fair tasks at all.
      
      While this can happen during the first period, if rq is always backlogged,
      RT tasks won't have the opportunity to execute anymore: rt_time reached
      rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
      throttled since rt timer didn't fire, replenishment is from now on eaten up
      by DL tasks that accrue their execution on rt_time (while rt timer is
      active - we have an RT task waiting for replenishment). FAIR tasks are
      not touched after this first period. Ok, this is not ideal, and the situation
      is even worse!
      
      What above (the nice case), practically never happens in reality, where
      your rt timer is not aligned to tasks periods, tasks are in general not
      periodic, etc.. Long story short, you always risk to overload your system.
      
      This patch is based on Peter's idea, but exploits an additional fact:
      if you don't have RT tasks enqueued, it makes little sense to continue
      incrementing rt_time once you reached the upper limit (DL tasks have their
      own mechanism for throttling).
      
      This cures both problems:
      
       - no matter how many DL instances in the past, you'll have an rt_time
         slightly above rt_runtime when an RT task is enqueued, and from that
         point on (after the first replenishment), the task will normally execute;
      
       - you can still eat up all bandwidth during the first period, but not
         anymore after that, remember that DL execution will increment rt_time
         till the upper limit is reached.
      
      The situation is still not perfect! But, we have a simple solution for now,
      that limits how much you can jeopardize your system, as we keep working
      towards the right answer: RT groups scheduled using deadline servers.
      Reported-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      faa59937
    • J
      sched/deadline: Switch CPU's presence test order · eec751ed
      Juri Lelli 提交于
      Commit 82b95800 ("sched/deadline: Test for CPU's presence explicitly")
      changed how we check if a CPU returned by cpudeadline machinery is
      valid. But, we don't want to call cpu_present() if best_cpu is
      equal to -1. So, switch the order of tests inside WARN_ON().
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: boris.ostrovsky@oracle.com
      Cc: konrad.wilk@oracle.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eec751ed
    • K
      sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration · 3908ac13
      Kirill Tkhai 提交于
      In deadline class we do not have group scheduling.
      
      So, let's remove unnecessary
      
      	X = X;
      
      equations.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3908ac13
    • G
      sched: Fix double normalization of vruntime · 791c9e02
      George McCollister 提交于
      dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
      which appears to guarentee that the !se->on_rq condition is met.
      If the task has done set_current_state(TASK_INTERRUPTIBLE) without
      schedule() the second condition will be met and vruntime will be
      incorrectly adjusted twice.
      
      In certain cases this can result in the task's vruntime never increasing
      past the vruntime of other tasks on the CFS' run queue, starving them of
      CPU time.
      
      This patch changes switched_from_fair() to use !p->on_rq instead of
      !se->on_rq.
      
      I'm able to cause a task with a priority of 120 to starve all other
      tasks with the same priority on an ARM platform running 3.2.51-rt72
      PREEMPT RT by writing one character at time to a serial tty (16550 UART)
      in a tight loop. I'm also able to verify making this change corrects the
      problem on that platform and kernel version.
      Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      791c9e02
    • C
      genirq: Remove racy waitqueue_active check · c685689f
      Chuansheng Liu 提交于
      We hit one rare case below:
      
      T1 calling disable_irq(), but hanging at synchronize_irq()
      always;
      The corresponding irq thread is in sleeping state;
      And all CPUs are in idle state;
      
      After analysis, we found there is one possible scenerio which
      causes T1 is waiting there forever:
      CPU0                                       CPU1
       synchronize_irq()
        wait_event()
          spin_lock()
                                                 atomic_dec_and_test(&threads_active)
            insert the __wait into queue
          spin_unlock()
                                                 if(waitqueue_active)
          atomic_read(&threads_active)
                                                   wake_up()
      
      Here after inserted the __wait into queue on CPU0, and before
      test if queue is empty on CPU1, there is no barrier, it maybe
      cause it is not visible for CPU1 immediately, although CPU0 has
      updated the queue list.
      It is similar for CPU0 atomic_read() threads_active also.
      
      So we'd need one smp_mb() before waitqueue_active.that, but removing
      the waitqueue_active() check solves it as wel l and it makes
      things simple and clear.
      Signed-off-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Cc: Xiaoming Wang <xiaoming.wang@intel.com>
      Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c685689f
  10. 22 2月, 2014 9 次提交
  11. 21 2月, 2014 1 次提交
  12. 20 2月, 2014 1 次提交
    • S
      sched_clock: Prevent callers from seeing half-updated data · 5ae8aabe
      Stephen Boyd 提交于
      The generic sched_clock registration function was previously
      done lockless, due to the fact that it was expected to be called
      only once. However, now there are systems that may register
      multiple sched_clock sources, for which the lack of locking has
      casued problems:
      
      If two sched_clock sources are registered we may end up in a
      situation where a call to sched_clock() may be accessing the
      epoch cycle count for the old counter and the cycle count for the
      new counter. This can lead to confusing results where
      sched_clock() values jump and then are reset to 0 (due to the way
      the registration function forces the epoch_ns to be 0).
      
      Fix this by reorganizing the registration function to hold the
      seqlock for as short a time as possible while we update the
      clock_data structure for a new counter. We also put any
      accumulated time into epoch_ns instead of resetting the time to
      0 so that the clock doesn't reset after each successful
      registration.
      
      [jstultz: Added extra context to the commit message]
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Cartwright <joshc@codeaurora.org>
      Link: http://lkml.kernel.org/r/1392662736-7803-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5ae8aabe
  13. 19 2月, 2014 2 次提交
    • T
      cgroup: update cgroup_enable_task_cg_lists() to grab siglock · 532de3fc
      Tejun Heo 提交于
      Currently, there's nothing preventing cgroup_enable_task_cg_lists()
      from missing set PF_EXITING and race against cgroup_exit().  Depending
      on the timing, cgroup_exit() may finish with the task still linked on
      css_set leading to list corruption.  Fix it by grabbing siglock in
      cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
      visible.
      
      This whole on-demand cg_list optimization is extremely fragile and has
      ample possibility to lead to bugs which can cause things like
      once-a-year oops during boot.  I'm wondering whether the better
      approach would be just adding "cgroup_disable=all" handling which
      disables the whole cgroup rather than tempting fate with this
      on-demand craziness.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      532de3fc
    • L
      workqueue: ensure @task is valid across kthread_stop() · 5bdfff96
      Lai Jiangshan 提交于
      When a kworker should die, the kworkre is notified through WORKER_DIE
      flag instead of kthread_should_stop().  This, IIRC, is primarily to
      keep the test synchronized inside worker_pool lock.  WORKER_DIE is
      first set while holding pool->lock, the lock is dropped and
      kthread_stop() is called.
      
      Unfortunately, this means that there's a slight chance that the target
      kworker may see WORKER_DIE before kthread_stop() finishes and exits
      and frees the target task before or during kthread_stop().
      
      Fix it by pinning the target task before setting WORKER_DIE and
      putting it after kthread_stop() is done.
      
      tj: Improved patch description and comment.  Moved pinning above
          WORKER_DIE for better signify what it's protecting.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5bdfff96
  14. 18 2月, 2014 2 次提交
    • J
      inotify: Fix reporting of cookies for inotify events · 45a22f4c
      Jan Kara 提交于
      My rework of handling of notification events (namely commit 7053aee2
      "fsnotify: do not share events between notification groups") broke
      sending of cookies with inotify events. We didn't propagate the value
      passed to fsnotify() properly and passed 4 uninitialized bytes to
      userspace instead (so it is also an information leak). Sadly I didn't
      notice this during my testing because inotify cookies aren't used very
      much and LTP inotify tests ignore them.
      
      Fix the problem by passing the cookie value properly.
      
      Fixes: 7053aee2Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      45a22f4c
    • L
      printk: fix syslog() overflowing user buffer · e4178d80
      Linus Torvalds 提交于
      This is not a buffer overflow in the traditional sense: we don't
      overflow any *kernel* buffers, but we do mis-count the amount of data we
      copy back to user space for the SYSLOG_ACTION_READ_ALL case.
      
      In particular, if the user buffer is too small to hold everything, and
      *if* there is a continuation line at just the right place, we can end up
      giving the user more data than he asked for.
      
      The reason is that we first count up the number of bytes all the log
      records contains, then we walk the records again until we've skipped the
      records at the beginning that won't fit, and then we walk the rest of
      the records and copy them to the user space buffer.
      
      And in between that "skip the initial records that won't fit" and the
      "copy the records that *will* fit to user space", we reset the 'prev'
      variable that contained the record information for the last record not
      copied.  That meant that when we started copying to user space, we now
      had a different character count than what we had originally calculated
      in the first record walk-through.
      
      The fix is to simply not clear the 'prev' flags value (in both cases
      where we had the same logic: syslog_print_all and kmsg_dump_get_buffer:
      the latter is used for pstore-like dumping)
      Reported-and-tested-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Acked-by: NKay Sievers <kay@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4178d80
  15. 14 2月, 2014 1 次提交
    • T
      tick: Clear broadcast pending bit when switching to oneshot · dd5fd9b9
      Thomas Gleixner 提交于
      AMD systems which use the C1E workaround in the amd_e400_idle routine
      trigger the WARN_ON_ONCE in the broadcast code when onlining a CPU.
      
      The reason is that the idle routine of those AMD systems switches the
      cpu into forced broadcast mode early on before the newly brought up
      CPU can switch over to high resolution / NOHZ mode. The timer related
      CPU1 bringup looks like this:
      
        clockevent_register_device(local_apic);
        tick_setup(local_apic);
        ...
        idle()
      	tick_broadcast_on_off(FORCE);
      	tick_broadcast_oneshot_control(ENTER)
      	  cpumask_set(cpu, broadcast_oneshot_mask);
      	halt();
      
      Now the broadcast interrupt on CPU0 sets CPU1 in the
      broadcast_pending_mask and wakes CPU1. So CPU1 continues:
      
      	local_apic_timer_interrupt()
      	   tick_handle_periodic();
      	   softirq()
      	     tick_init_highres();
      	       cpumask_clr(cpu, broadcast_oneshot_mask);
      	
      	tick_broadcast_oneshot_control(ENTER)
      	   WARN_ON(cpumask_test(cpu, broadcast_pending_mask);
      
      So while we remove CPU1 from the broadcast_oneshot_mask when we switch
      over to highres mode, we do not clear the pending bit, which then
      triggers the warning when we go back to idle.
      
      The reason why this is only visible on C1E affected AMD systems is
      that the other machines enter the deep sleep states via
      acpi_idle/intel_idle and exit the broadcast mode before executing the
      remote triggered local_apic_timer_interrupt. So the pending bit is
      already cleared when the switch over to highres mode is clearing the
      oneshot mask.
      
      The solution is simple: Clear the pending bit together with the mask
      bit when we switch over to highres mode.
      
      Stanislaw came up independently with the same patch by enforcing the
      C1E workaround and debugging the fallout. I picked mine, because mine
      has a changelog :)
      Reported-by: Npoma <pomidorabelisima@gmail.com>
      Debugged-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Olaf Hering <olaf@aepfle.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Justin M. Forbes <jforbes@redhat.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1402111434180.21991@ionos.tec.linutronix.de
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      dd5fd9b9
  16. 13 2月, 2014 1 次提交
    • T
      Revert "cgroup: use an ordered workqueue for cgroup destruction" · 1a11533f
      Tejun Heo 提交于
      This reverts commit ab3f5faa.
      Explanation from Hugh:
      
        It's because more thorough testing, by others here, found that it
        wasn't always solving the problem: so I asked Tejun privately to
        hold off from sending it in, until we'd worked out why not.
      
        Most of our testing being on a v3,11-based kernel, it was perfectly
        possible that the problem was merely our own e.g. missing Tejun's
        8a2b7538 ("workqueue: fix ordered workqueues in NUMA setups").
      
        But that turned out not to be enough to fix it either. Then Filipe
        pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
        before we ever get to put the offline on to the workqueue: by the
        time we get to the workqueue, the ordering has already been lost.
      
        So, thanks for the Acks, but I'm afraid that this ordered workqueue
        solution is just not good enough: we should simply forget that patch
        and provide a different answer."
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      1a11533f
  17. 12 2月, 2014 1 次提交
    • S
      ring-buffer: Fix first commit on sub-buffer having non-zero delta · d651aa1d
      Steven Rostedt (Red Hat) 提交于
      Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
      that page use a 27 bit delta against that timestamp in order to save on
      bits written to the ring buffer. If the time between events is larger than
      what the 27 bits can hold, a "time extend" event is added to hold the
      entire 64 bit timestamp again and the events after that hold a delta from
      that timestamp.
      
      As a "time extend" is always paired with an event, it is logical to just
      allocate the event with the time extend, to make things a bit more efficient.
      
      Unfortunately, when the pairing code was written, it removed the "delta = 0"
      from the first commit on a page, causing the events on the page to be
      slightly skewed.
      
      Fixes: 69d1b839 "ring-buffer: Bind time extend and data events together"
      Cc: stable@vger.kernel.org # 2.6.37+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d651aa1d
  18. 11 2月, 2014 1 次提交
    • L
      cgroup: protect modifications to cgroup_idr with cgroup_mutex · 0ab02ca8
      Li Zefan 提交于
      Setup cgroupfs like this:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mkdir /cgroup/sub1
        # mkdir /cgroup/sub2
      
      Then run these two commands:
        # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
        # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &
      
      After seconds you may see this warning:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
      idr_remove called for id=6 which is not allocated.
      ...
      Call Trace:
       [<ffffffff8156063c>] dump_stack+0x7a/0x96
       [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
       [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff81300aa7>] sub_remove+0x87/0x1b0
       [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
       [<ffffffff81300bf5>] idr_remove+0x25/0xd0
       [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
       [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
       [<ffffffff8107cdbc>] process_one_work+0x26c/0x550
       [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
       [<ffffffff81085f96>] kthread+0xe6/0xf0
       [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
      ---[ end trace 2d1577ec10cf80d0 ]---
      
      It's because allocating/removing cgroup ID is not properly synchronized.
      
      The bug was introduced when we converted cgroup_ida to cgroup_idr.
      While synchronization is already done inside ida_simple_{get,remove}(),
      users are responsible for concurrent calls to idr_{alloc,remove}().
      
      tj: Refreshed on top of b58c8998 ("cgroup: fix error return from
      cgroup_create()").
      
      Fixes: 4e96ee8e ("cgroup: convert cgroup_ida to cgroup_idr")
      Cc: <stable@vger.kernel.org> #3.12+
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ab02ca8