1. 13 1月, 2014 4 次提交
    • D
      futexes: Avoid taking the hb->lock if there's nothing to wake up · b0c29f79
      Davidlohr Bueso 提交于
      In futex_wake() there is clearly no point in taking the hb->lock
      if we know beforehand that there are no tasks to be woken. While
      the hash bucket's plist head is a cheap way of knowing this, we
      cannot rely 100% on it as there is a racy window between the
      futex_wait call and when the task is actually added to the
      plist. To this end, we couple it with the spinlock check as
      tasks trying to enter the critical region are most likely
      potential waiters that will be added to the plist, thus
      preventing tasks sleeping forever if wakers don't acknowledge
      all possible waiters.
      
      Furthermore, the futex ordering guarantees are preserved,
      ensuring that waiters either observe the changed user space
      value before blocking or is woken by a concurrent waker. For
      wakers, this is done by relying on the barriers in
      get_futex_key_refs() -- for archs that do not have implicit mb
      in atomic_inc(), we explicitly add them through a new
      futex_get_mm function. For waiters we rely on the fact that
      spin_lock calls already update the head counter, so spinners
      are visible even if the lock hasn't been acquired yet.
      
      For more details please refer to the updated comments in the
      code and related discussion:
      
        https://lkml.org/lkml/2013/11/26/556
      
      Special thanks to tglx for careful review and feedback.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0c29f79
    • T
      futexes: Document multiprocessor ordering guarantees · 99b60ce6
      Thomas Gleixner 提交于
      That's essential, if you want to hack on futexes.
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      99b60ce6
    • D
      futexes: Increase hash table size for better performance · a52b89eb
      Davidlohr Bueso 提交于
      Currently, the futex global hash table suffers from its fixed,
      smallish (for today's standards) size of 256 entries, as well as
      its lack of NUMA awareness. Large systems, using many futexes,
      can be prone to high amounts of collisions; where these futexes
      hash to the same bucket and lead to extra contention on the same
      hb->lock. Furthermore, cacheline bouncing is a reality when we
      have multiple hb->locks residing on the same cacheline and
      different futexes hash to adjacent buckets.
      
      This patch keeps the current static size of 16 entries for small
      systems, or otherwise, 256 * ncpus (or larger as we need to
      round the number to a power of 2). Note that this number of CPUs
      accounts for all CPUs that can ever be available in the system,
      taking into consideration things like hotpluging. While we do
      impose extra overhead at bootup by making the hash table larger,
      this is a one time thing, and does not shadow the benefits of
      this patch.
      
      Furthermore, as suggested by tglx, by cache aligning the hash
      buckets we can avoid access across cacheline boundaries and also
      avoid massive cache line bouncing if multiple cpus are hammering
      away at different hash buckets which happen to reside in the
      same cache line.
      
      Also, similar to other core kernel components (pid, dcache,
      tcp), by using alloc_large_system_hash() we benefit from its
      NUMA awareness and thus the table is distributed among the nodes
      instead of in a single one.
      
      For a custom microbenchmark that pounds on the uaddr hashing --
      making the wait path fail at futex_wait_setup() returning
      -EWOULDBLOCK for large amounts of futexes, we can see the
      following benefits on a 80-core, 8-socket 1Tb server:
      
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
       |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
       |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
       |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
       |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
       |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NWaiman Long <Waiman.Long@hp.com>
      Reviewed-and-tested-by: NJason Low <jason.low2@hp.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a52b89eb
    • J
      futexes: Clean up various details · 0d00c7b2
      Jason Low 提交于
      - Remove unnecessary head variables.
      - Delete unused parameter in queue_unlock().
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0d00c7b2
  2. 22 12月, 2013 1 次提交
    • M
      PM / sleep: Fix memory leak in pm_vt_switch_unregister(). · c6068504
      Masami Ichikawa 提交于
      kmemleak reported a memory leak as below.
      
      unreferenced object 0xffff880118f14700 (size 32):
        comm "swapper/0", pid 1, jiffies 4294877401 (age 123.283s)
        hex dump (first 32 bytes):
          00 01 10 00 00 00 ad de 00 02 20 00 00 00 ad de  .......... .....
          00 d4 d2 18 01 88 ff ff 01 00 00 00 00 04 00 00  ................
        backtrace:
          [<ffffffff814edb1e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811889dc>] kmem_cache_alloc_trace+0x1ec/0x260
          [<ffffffff810aba66>] pm_vt_switch_required+0x76/0xb0
          [<ffffffff812f39f5>] register_framebuffer+0x195/0x320
          [<ffffffff8130af18>] efifb_probe+0x718/0x780
          [<ffffffff81391495>] platform_drv_probe+0x45/0xb0
          [<ffffffff8138f407>] driver_probe_device+0x87/0x3a0
          [<ffffffff8138f7f3>] __driver_attach+0x93/0xa0
          [<ffffffff8138d413>] bus_for_each_dev+0x63/0xa0
          [<ffffffff8138ee5e>] driver_attach+0x1e/0x20
          [<ffffffff8138ea40>] bus_add_driver+0x180/0x250
          [<ffffffff8138fe74>] driver_register+0x64/0xf0
          [<ffffffff813913ba>] __platform_driver_register+0x4a/0x50
          [<ffffffff8191e028>] efifb_driver_init+0x12/0x14
          [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
          [<ffffffff818e40e0>] kernel_init_freeable+0x17b/0x201
      
      In pm_vt_switch_required(), "entry" variable is allocated via kmalloc().
      So, in pm_vt_switch_unregister(), it needs to call kfree() when object
      is deleted from list.
      Signed-off-by: NMasami Ichikawa <masami256@gmail.com>
      Reviewed-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c6068504
  3. 21 12月, 2013 1 次提交
  4. 20 12月, 2013 1 次提交
    • T
      libata, freezer: avoid block device removal while system is frozen · 85fbd722
      Tejun Heo 提交于
      Freezable kthreads and workqueues are fundamentally problematic in
      that they effectively introduce a big kernel lock widely used in the
      kernel and have already been the culprit of several deadlock
      scenarios.  This is the latest occurrence.
      
      During resume, libata rescans all the ports and revalidates all
      pre-existing devices.  If it determines that a device has gone
      missing, the device is removed from the system which involves
      invalidating block device and flushing bdi while holding driver core
      layer locks.  Unfortunately, this can race with the rest of device
      resume.  Because freezable kthreads and workqueues are thawed after
      device resume is complete and block device removal depends on
      freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
      progress, this can lead to deadlock - block device removal can't
      proceed because kthreads are frozen and kthreads can't be thawed
      because device resume is blocked behind block device removal.
      
      839a8e86 ("writeback: replace custom worker pool implementation
      with unbound workqueue") made this particular deadlock scenario more
      visible but the underlying problem has always been there - the
      original forker task and jbd2 are freezable too.  In fact, this is
      highly likely just one of many possible deadlock scenarios given that
      freezer behaves as a big kernel lock and we don't have any debug
      mechanism around it.
      
      I believe the right thing to do is getting rid of freezable kthreads
      and workqueues.  This is something fundamentally broken.  For now,
      implement a funny workaround in libata - just avoid doing block device
      hot[un]plug while the system is frozen.  Kernel engineering at its
      finest.  :(
      
      v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
          as a module.
      
      v3: Comment updated and polling interval changed to 10ms as suggested
          by Rafael.
      
      v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
          defined when FREEZER is not configured thus breaking build.
          Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTomaž Šolc <tomaz.solc@tablix.org>
      Reviewed-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
      Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: kbuild test robot <fengguang.wu@intel.com>
      85fbd722
  5. 19 12月, 2013 3 次提交
  6. 17 12月, 2013 5 次提交
    • C
      mutexes: Give more informative mutex warning in the !lock->owner case · 91f30a17
      Chuansheng Liu 提交于
      When mutex debugging is enabled and an imbalanced mutex_unlock()
      is called, we get the following, slightly confusing warning:
      
        [  364.208284] DEBUG_LOCKS_WARN_ON(lock->owner != current)
      
      But in that case the warning is due to an imbalanced mutex_unlock() call,
      and the lock->owner is NULL - so the message is misleading.
      
      So improve the message by testing for this case specifically:
      
         DEBUG_LOCKS_WARN_ON(!lock->owner)
      Signed-off-by: NLiu, Chuansheng <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1386136693.3650.48.camel@cliu38-desktop-build
      [ Improved the changelog, changed the patch to use !lock->owner consistently. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      91f30a17
    • K
      sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities · 757dfcaa
      Kirill Tkhai 提交于
      This patch touches the RT group scheduling case.
      
      Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
      priority, while rt_rq passed to them may be not the top-level rt_rq.
      This is wrong, because changing of priority on a child level does not
      guarantee that the priority is the highest all over the rq. So, this
      leak makes RT balancing unusable.
      
      The short example: the task having the highest priority among all rq's
      RT tasks (no one other task has the same priority) are waking on a
      throttle rt_rq.  The rq's cpupri is set to the task's priority
      equivalent, but real rq->rt.highest_prio.curr is less.
      
      The patch below fixes the problem.
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      757dfcaa
    • M
      sched: Assign correct scheduling domain to 'sd_llc' · 5d4cf996
      Mel Gorman 提交于
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5d4cf996
    • A
      perf: Disable all pmus on unthrottling and rescheduling · 44377277
      Alexander Shishkin 提交于
      Currently, only one PMU in a context gets disabled during unthrottling
      and event_sched_{out,in}(), however, events in one context may belong to
      different pmus, which results in PMUs being reprogrammed while they are
      still enabled.
      
      This means that mixed PMU use [which is rare in itself] resulted in
      potentially completely unreliable results: corrupted events, bogus
      results, etc.
      
      This patch temporarily disables PMUs that correspond to
      each event in the context while these events are being modified.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: http://lkml.kernel.org/r/1387196256-8030-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      44377277
    • L
      cgroup: don't recycle cgroup id until all csses' have been destroyed · c1a71504
      Li Zefan 提交于
      Hugh reported this bug:
      
      > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
      >
      > mkdir -p /tmp/tmpfs /tmp/memcg
      > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
      > mount -t cgroup -o memory memcg /tmp/memcg
      > mkdir /tmp/memcg/old
      > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
      > echo $$ >/tmp/memcg/old/tasks
      > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
      > echo $$ >/tmp/memcg/tasks
      > rmdir /tmp/memcg/old
      > sleep 1	# let rmdir work complete
      > mkdir /tmp/memcg/new
      > umount /tmp/tmpfs
      > dmesg | grep WARNING
      > rmdir /tmp/memcg/new
      > umount /tmp/memcg
      >
      > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
      >                            res_counter_uncharge_locked+0x1f/0x2f()
      >
      > Breakage comes from 34c00c31 ("memcg: convert to use cgroup id").
      >
      > The lifetime of a cgroup id is different from the lifetime of the
      > css id it replaced: memsw's css_get()s do nothing to hold on to the
      > old cgroup id, it soon gets recycled to a new cgroup, which then
      > mysteriously inherits the old's swap, without any charge for it.
      
      Instead of removing cgroup id right after all the csses have been
      offlined, we should do that after csses have been destroyed.
      
      To make sure an invalid css pointer won't be returned after the css
      is destroyed, make sure css_from_id() returns NULL in this case.
      
      tj: Updated comment to note planned changes for cgrp->id.
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c1a71504
  7. 16 12月, 2013 2 次提交
    • M
      ftrace: Initialize the ftrace profiler for each possible cpu · c4602c1c
      Miao Xie 提交于
      Ftrace currently initializes only the online CPUs. This implementation has
      two problems:
      - If we online a CPU after we enable the function profile, and then run the
        test, we will lose the trace information on that CPU.
        Steps to reproduce:
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # echo 1 > /sys/devices/system/cpu/cpu1/online
        # run test
      - If we offline a CPU before we enable the function profile, we will not clear
        the trace information when we enable the function profile. It will trouble
        the users.
        Steps to reproduce:
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # run test
        # cat trace_stat/function*
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # echo 0 > function_profile_enabled
        # echo 1 > function_profile_enabled
        # cat trace_stat/function*
        # run test
        # cat trace_stat/function*
      
      So it is better that we initialize the ftrace profiler for each possible cpu
      every time we enable the function profile instead of just the online ones.
      
      Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c4602c1c
    • P
      rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods · 6303b9c8
      Paul E. McKenney 提交于
      RCU must ensure that there is the equivalent of a full memory
      barrier between any memory access preceding grace period and any
      memory access following that same grace period, regardless of
      which CPU(s) happen to execute the two memory accesses.
      Therefore, downgrading UNLOCK+LOCK to no longer imply a full
      memory barrier requires some adjustments to RCU.
      
      This commit therefore adds smp_mb__after_unlock_lock()
      invocations as needed after the RCU lock acquisitions that need
      to be part of a full-memory-barrier UNLOCK+LOCK.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6303b9c8
  8. 13 12月, 2013 5 次提交
    • X
      KEYS: fix uninitialized persistent_keyring_register_sem · 6bd364d8
      Xiao Guangrong 提交于
      We run into this bug:
      [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
      [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
      [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
      [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
      [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
      [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
      [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
      [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300   Not tainted  (3.10.0-48.el7.ppc64)
      [ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28824242  XER: 20000000
      [ 2736.063415] SOFTE: 0
      [ 2736.063418] CFAR: c00000000000908c
      [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
      [ 2736.063425]
      GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
      GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
      GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
      GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
      GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
      GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
      [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
      [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063508] PACATMSCRATCH [800000000280f032]
      [ 2736.063511] Call Trace:
      [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
      [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
      [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
      [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
      [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
      [ 2736.063553] Instruction dump:
      [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
      [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
      [ 2736.063579] ---[ end trace 2708241785538296 ]---
      
      It's caused by uninitialized persistent_keyring_register_sem.
      
      The bug was introduced by commit f36f8c75, two typos are in that commit:
      CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
      krb_cache_register_sem should be persistent_keyring_register_sem.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6bd364d8
    • K
      KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y · f46a3cbb
      Kirill Tkhai 提交于
      Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f46a3cbb
    • D
      X.509: Fix certificate gathering · d7ec435f
      David Howells 提交于
      Fix the gathering of certificates from both the source tree and the build tree
      to correctly calculate the pathnames of all the certificates.
      
      The problem was that if the default generated cert, signing_key.x509, didn't
      exist then it would not have a path attached and if it did, it would have a
      path attached.
      
      This means that the contents of kernel/.x509.list would change between the
      first compilation in a directory and the second.  After the second it would
      remain stable because the signing_key.x509 file exists.
      
      The consequence was that the kernel would get relinked unconditionally on the
      second recompilation.  The second recompilation would also show something like
      this:
      
         X.509 certificate list changed
           CERTS   kernel/x509_certificate_list
           - Including cert /home/torvalds/v2.6/linux/signing_key.x509
           AS      kernel/system_certificates.o
           LD      kernel/built-in.o
      
      which is why the relink would happen.
      
      
      Unfortunately, it isn't a simple matter of just sticking a path on the front
      of the filename of the certificate in the build directory as make can't then
      work out how to build it.
      
      So the path has to be prepended to the name for sorting and duplicate
      elimination and then removed for the make rule if it is in the build tree.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d7ec435f
    • L
      futex: move user address verification up to common code · 5cdec2d8
      Linus Torvalds 提交于
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cdec2d8
    • L
      futex: fix handling of read-only-mapped hugepages · f12d5bfc
      Linus Torvalds 提交于
      The hugepage code had the exact same bug that regular pages had in
      commit 7485d0d3 ("futexes: Remove rw parameter from
      get_futex_key()").
      
      The regular page case was fixed by commit 9ea71503 ("futex: Fix
      regression with read only mappings"), but the transparent hugepage case
      (added in a5b338f2: "thp: update futex compound knowledge") case
      remained broken.
      
      Found by Dave Jones and his trinity tool.
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Cc: stable@kernel.org # v2.6.38+
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f12d5bfc
  9. 11 12月, 2013 4 次提交
    • P
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra 提交于
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dbdb155
    • P
      sched: Initialize power_orig for overlapping groups · 8e8339a3
      Peter Zijlstra 提交于
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e8339a3
    • H
      KEYS: correct alignment of system_certificate_list content in assembly file · 62226983
      Hendrik Brueckner 提交于
      Apart from data-type specific alignment constraints, there are also
      architecture-specific alignment requirements.
      For example, on s390 symbols must be on even addresses implying a 2-byte
      alignment.  If the system_certificate_list_end symbol is on an odd address
      and if this address is loaded, the least-significant bit is ignored.  As a
      result, the load_system_certificate_list() fails to load the certificates
      because of a wrong certificate length calculation.
      
      To be safe, align system_certificate_list on an 8-byte boundary.  Also improve
      the length calculation of the system_certificate_list content.  Introduce a
      system_certificate_list_size (8-byte aligned because of unsigned long) variable
      that stores the length.  Let the linker calculate this size by introducing
      a start and end label for the certificate content.
      Signed-off-by: NHendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      62226983
    • R
      Ignore generated file kernel/x509_certificate_list · 7cfe5b33
      Rusty Russell 提交于
      $ git status
      # On branch pending-rebases
      # Untracked files:
      #   (use "git add <file>..." to include in what will be committed)
      #
      #	kernel/x509_certificate_list
      nothing added to commit but untracked files present (use "git add" to track)
      $
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7cfe5b33
  10. 08 12月, 2013 1 次提交
  11. 07 12月, 2013 1 次提交
    • T
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo 提交于
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Reported-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  12. 06 12月, 2013 1 次提交
  13. 29 11月, 2013 2 次提交
  14. 28 11月, 2013 2 次提交
    • T
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo 提交于
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • P
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra 提交于
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NJuri Lelli <juri.lelli@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
  15. 27 11月, 2013 3 次提交
  16. 26 11月, 2013 3 次提交
    • S
      ftrace: Fix function graph with loading of modules · 8a56d776
      Steven Rostedt (Red Hat) 提交于
      Commit 8c4f3c3f "ftrace: Check module functions being traced on reload"
      fixed module loading and unloading with respect to function tracing, but
      it missed the function graph tracer. If you perform the following
      
       # cd /sys/kernel/debug/tracing
       # echo function_graph > current_tracer
       # modprobe nfsd
       # echo nop > current_tracer
      
      You'll get the following oops message:
      
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
       Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
       CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
        0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
        ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
       Call Trace:
        [<ffffffff814fe193>] dump_stack+0x4f/0x7c
        [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
        [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
        [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
        [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
        [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
        [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
        [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
        [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
        [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
        [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
        [<ffffffff81122aed>] vfs_write+0xab/0xf6
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff81122dbd>] SyS_write+0x59/0x82
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
       ---[ end trace 940358030751eafb ]---
      
      The above mentioned commit didn't go far enough. Well, it covered the
      function tracer by adding checks in __register_ftrace_function(). The
      problem is that the function graph tracer circumvents that (for a slight
      efficiency gain when function graph trace is running with a function
      tracer. The gain was not worth this).
      
      The problem came with ftrace_startup() which should always be called after
      __register_ftrace_function(), if you want this bug to be completely fixed.
      
      Anyway, this solution moves __register_ftrace_function() inside of
      ftrace_startup() and removes the need to call them both.
      Reported-by: NDave Wysochanski <dwysocha@redhat.com>
      Fixes: ed926f9b ("ftrace: Use counters to enable functions to trace")
      Cc: stable@vger.kernel.org # 3.0+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8a56d776
    • B
      Revert "workqueue: allow work_on_cpu() to be called recursively" · 12997d1a
      Bjorn Helgaas 提交于
      This reverts commit c2fda509.
      
      c2fda509 removed lockdep annotation from work_on_cpu() to work around
      the PCI path that calls work_on_cpu() from within a work_on_cpu() work item
      (PF driver .probe() method -> pci_enable_sriov() -> add VFs -> VF driver
      .probe method).
      
      961da7fb6b22 ("PCI: Avoid unnecessary CPU switch when calling driver
      .probe() method) avoids that recursive work_on_cpu() use in a different
      way, so this revert restores the work_on_cpu() lockdep annotation.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      12997d1a
    • L
      irq: Enable all irqs unconditionally in irq_resume · ac01810c
      Laxman Dewangan 提交于
      When the system enters suspend, it disables all interrupts in
      suspend_device_irqs(), including the interrupts marked EARLY_RESUME.
      
      On the resume side things are different. The EARLY_RESUME interrupts
      are reenabled in sys_core_ops->resume and the non EARLY_RESUME
      interrupts are reenabled in the normal system resume path.
      
      When suspend_noirq() failed or suspend is aborted for any other
      reason, we might omit the resume side call to sys_core_ops->resume()
      and therefor the interrupts marked EARLY_RESUME are not reenabled and
      stay disabled forever.
      
      To solve this, enable all irqs unconditionally in irq_resume()
      regardless whether interrupts marked EARLY_RESUMEhave been already
      enabled or not.
      
      This might try to reenable already enabled interrupts in the non
      failure case, but the only affected platform is XEN and it has been
      confirmed that it does not cause any side effects.
      
      [ tglx: Massaged changelog. ]
      Signed-off-by: NLaxman Dewangan <ldewangan@nvidia.com>
      Acked-by-and-tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NPavel Machek <pavel@ucw.cz>
      Cc: <ian.campbell@citrix.com>
      Cc: <rjw@rjwysocki.net>
      Cc: <len.brown@intel.com>
      Cc: <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ac01810c
  17. 23 11月, 2013 1 次提交
    • L
      workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues · 4e8b22bd
      Li Bin 提交于
      When one work starts execution, the high bits of work's data contain
      pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
      is assigned WORK_OFFQ_POOL_NONE when the work being initialized
      indicating that no pool is associated and get_work_pool() uses it to
      check the associated pool. So if worker_pool_assign_id() assigns a
      ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
      leakage, and it may break the non-reentrance guarantee.
      
      This patch fix this issue by modifying the worker_pool_assign_id()
      function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.
      
      Furthermore, in the current implementation, the BUILD_BUG_ON() in
      init_workqueues makes no sense. The number of worker pools needed
      cannot be determined at compile time, because the number of backing
      pools for UNBOUND workqueues is dynamic based on the assigned custom
      attributes. So remove it.
      
      tj: Minor comment and indentation updates.
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4e8b22bd