1. 19 4月, 2013 4 次提交
    • W
      mutex: Back out architecture specific check for negative mutex count · cc189d25
      Waiman Long 提交于
      Linus suggested that probably all the supported architectures can
      allow a negative mutex count without incorrect behavior, so we can
      then back out the architecture specific change and allow the
      mutex count to go to any negative number. That should further
      reduce contention for non-x86 architecture.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chandramouleeswaran Aswin <aswin@hp.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Norton Scott J <scott.norton@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cc189d25
    • W
      mutex: Queue mutex spinners with MCS lock to reduce cacheline contention · 2bd2c92c
      Waiman Long 提交于
      The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
      turned on) allow multiple tasks to spin on a single mutex
      concurrently. A potential problem with the current approach is
      that when the mutex becomes available, all the spinning tasks
      will try to acquire the mutex more or less simultaneously. As a
      result, there will be a lot of cacheline bouncing especially on
      systems with a large number of CPUs.
      
      This patch tries to reduce this kind of contention by putting
      the mutex spinners into a queue so that only the first one in
      the queue will try to acquire the mutex. This will reduce
      contention and allow all the tasks to move forward faster.
      
      The queuing of mutex spinners is done using an MCS lock based
      implementation which will further reduce contention on the mutex
      cacheline than a similar ticket spinlock based implementation.
      This patch will add a new field into the mutex data structure
      for holding the MCS lock. This expands the mutex size by 8 bytes
      for 64-bit system and 4 bytes for 32-bit system. This overhead
      will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.
      
      The following table shows the jobs per minute (JPM) scalability
      data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
      numactl command is used to restrict the running of the fserver
      workloads to 1/2/4/8 nodes with hyperthreading off.
      
      +-----------------+-----------+-----------+-------------+----------+
      |  Configuration  | Mean JPM  | Mean JPM  |  Mean JPM   | % Change |
      |                 | w/o patch | patch 1   | patches 1&2 |  1->1&2  |
      +-----------------+------------------------------------------------+
      |                 |              User Range 1100 - 2000            |
      +-----------------+------------------------------------------------+
      | 8 nodes, HT off |  227972   |  227237   |   305043    |  +34.2%  |
      | 4 nodes, HT off |  393503   |  381558   |   394650    |   +3.4%  |
      | 2 nodes, HT off |  334957   |  325240   |   338853    |   +4.2%  |
      | 1 node , HT off |  198141   |  197972   |   198075    |   +0.1%  |
      +-----------------+------------------------------------------------+
      |                 |              User Range 200 - 1000             |
      +-----------------+------------------------------------------------+
      | 8 nodes, HT off |  282325   |  312870   |   332185    |   +6.2%  |
      | 4 nodes, HT off |  390698   |  378279   |   393419    |   +4.0%  |
      | 2 nodes, HT off |  336986   |  326543   |   340260    |   +4.2%  |
      | 1 node , HT off |  197588   |  197622   |   197582    |    0.0%  |
      +-----------------+-----------+-----------+-------------+----------+
      
      At low user range 10-100, the JPM differences were within +/-1%.
      So they are not that interesting.
      
      The fserver workload uses mutex spinning extensively. With just
      the mutex change in the first patch, there is no noticeable
      change in performance.  Rather, there is a slight drop in
      performance. This mutex spinning patch more than recovers the
      lost performance and show a significant increase of +30% at high
      user load with the full 8 nodes. Similar improvements were also
      seen in a 3.8 kernel.
      
      The table below shows the %time spent by different kernel
      functions as reported by perf when running the fserver workload
      at 1500 users with all 8 nodes.
      
      +-----------------------+-----------+---------+-------------+
      |        Function       |  % time   | % time  |   % time    |
      |                       | w/o patch | patch 1 | patches 1&2 |
      +-----------------------+-----------+---------+-------------+
      | __read_lock_failed    |  34.96%   | 34.91%  |   29.14%    |
      | __write_lock_failed   |  10.14%   | 10.68%  |    7.51%    |
      | mutex_spin_on_owner   |   3.62%   |  3.42%  |    2.33%    |
      | mspin_lock            |    N/A    |   N/A   |    9.90%    |
      | __mutex_lock_slowpath |   1.46%   |  0.81%  |    0.14%    |
      | _raw_spin_lock        |   2.25%   |  2.50%  |    1.10%    |
      +-----------------------+-----------+---------+-------------+
      
      The fserver workload for an 8-node system is dominated by the
      contention in the read/write lock. Mutex contention also plays a
      role. With the first patch only, mutex contention is down (as
      shown by the __mutex_lock_slowpath figure) which help a little
      bit. We saw only a few percents improvement with that.
      
      By applying patch 2 as well, the single mutex_spin_on_owner
      figure is now split out into an additional mspin_lock figure.
      The time increases from 3.42% to 11.23%. It shows a great
      reduction in contention among the spinners leading to a 30%
      improvement. The time ratio 9.9/2.33=4.3 indicates that there
      are on average 4+ spinners waiting in the spin_lock loop for
      each spinner in the mutex_spin_on_owner loop. Contention in
      other locking functions also go down by quite a lot.
      
      The table below shows the performance change of both patches 1 &
      2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
      hyperthreading off).
      
      +--------------+---------------+----------------+-----------------+
      |   Workload   | mean % change | mean % change  | mean % change   |
      |              | 10-100 users  | 200-1000 users | 1100-2000 users |
      +--------------+---------------+----------------+-----------------+
      | alltests     |      0.0%     |     -0.8%      |     +0.6%       |
      | five_sec     |     -0.3%     |     +0.8%      |     +0.8%       |
      | high_systime |     +0.4%     |     +2.4%      |     +2.1%       |
      | new_fserver  |     +0.1%     |    +14.1%      |    +34.2%       |
      | shared       |     -0.5%     |     -0.3%      |     -0.4%       |
      | short        |     -1.7%     |     -9.8%      |     -8.3%       |
      +--------------+---------------+----------------+-----------------+
      
      The short workload is the only one that shows a decline in
      performance probably due to the spinner locking and queuing
      overhead.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chandramouleeswaran Aswin <aswin@hp.com>
      Cc: Norton Scott J <scott.norton@hp.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2bd2c92c
    • W
      mutex: Make more scalable by doing less atomic operations · 0dc8c730
      Waiman Long 提交于
      In the __mutex_lock_common() function, an initial entry into
      the lock slow path will cause two atomic_xchg instructions to be
      issued. Together with the atomic decrement in the fast path, a
      total of three atomic read-modify-write instructions will be
      issued in rapid succession. This can cause a lot of cache
      bouncing when many tasks are trying to acquire the mutex at the
      same time.
      
      This patch will reduce the number of atomic_xchg instructions
      used by checking the counter value first before issuing the
      instruction. The atomic_read() function is just a simple memory
      read. The atomic_xchg() function, on the other hand, can be up
      to 2 order of magnitude or even more in cost when compared with
      atomic_read(). By using atomic_read() to check the value first
      before calling atomic_xchg(), we can avoid a lot of unnecessary
      cache coherency traffic. The only downside with this change is
      that a task on the slow path will have a tiny bit less chance of
      getting the mutex when competing with another task in the fast
      path.
      
      The same is true for the atomic_cmpxchg() function in the
      mutex-spin-on-owner loop. So an atomic_read() is also performed
      before calling atomic_cmpxchg().
      
      The mutex locking and unlocking code for the x86 architecture
      can allow any negative number to be used in the mutex count to
      indicate that some tasks are waiting for the mutex. I am not so
      sure if that is the case for the other architectures. So the
      default is to avoid atomic_xchg() if the count has already been
      set to -1. For x86, the check is modified to include all
      negative numbers to cover a larger case.
      
      The following table shows the jobs per minutes (JPM) scalability
      data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
      numactl command is used to restrict the running of the
      high_systime workloads to 1/2/4/8 nodes with hyperthreading on
      and off.
      
      +-----------------+-----------+------------+----------+
      |  Configuration  | Mean JPM  |  Mean JPM  | % Change |
      |		  | w/o patch | with patch |	      |
      +-----------------+-----------------------------------+
      |		  |      User Range 1100 - 2000	      |
      +-----------------+-----------------------------------+
      | 8 nodes, HT on  |    36980   |   148590  | +301.8%  |
      | 8 nodes, HT off |    42799   |   145011  | +238.8%  |
      | 4 nodes, HT on  |    61318   |   118445  |  +51.1%  |
      | 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
      | 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
      | 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
      | 1 node , HT on  |   149042   |   147671  |   -0.9%  |
      | 1 node , HT off |   126036   |   126533  |   +0.4%  |
      +-----------------+-----------------------------------+
      |		  |       User Range 200 - 1000	      |
      +-----------------+-----------------------------------+
      | 8 nodes, HT on  |   41525    |   122349  | +194.6%  |
      | 8 nodes, HT off |   49866    |   124032  | +148.7%  |
      | 4 nodes, HT on  |   66409    |   106984  |  +61.1%  |
      | 4 nodes, HT off |  119880    |   130508  |   +8.9%  |
      | 2 nodes, HT on  |  138003    |   133948  |   -2.9%  |
      | 2 nodes, HT off |  132792    |   131997  |   -0.6%  |
      | 1 node , HT on  |  116593    |   115859  |   -0.6%  |
      | 1 node , HT off |  104499    |   104597  |   +0.1%  |
      +-----------------+------------+-----------+----------+
      
      At low user range 10-100, the JPM differences were within +/-1%.
      So they are not that interesting.
      
      AIM7 benchmark run has a pretty large run-to-run variance due to
      random nature of the subtests executed. So a difference of less
      than +-5% may not be really significant.
      
      This patch improves high_systime workload performance at 4 nodes
      and up by maintaining transaction rates without significant
      drop-off at high node count.  The patch has practically no
      impact on 1 and 2 nodes system.
      
      The table below shows the percentage time (as reported by perf
      record -a -s -g) spent on the __mutex_lock_slowpath() function
      by the high_systime workload at 1500 users for 2/4/8-node
      configurations with hyperthreading off.
      
      +---------------+-----------------+------------------+---------+
      | Configuration | %Time w/o patch | %Time with patch | %Change |
      +---------------+-----------------+------------------+---------+
      |    8 nodes    |      65.34%     |      0.69%       |  -99%   |
      |    4 nodes    |       8.70%	  |      1.02%	     |  -88%   |
      |    2 nodes    |       0.41%     |      0.32%       |  -22%   |
      +---------------+-----------------+------------------+---------+
      
      It is obvious that the dramatic performance improvement at 8
      nodes was due to the drastic cut in the time spent within the
      __mutex_lock_slowpath() function.
      
      The table below show the improvements in other AIM7 workloads
      (at 8 nodes, hyperthreading off).
      
      +--------------+---------------+----------------+-----------------+
      |   Workload   | mean % change | mean % change  | mean % change   |
      |              | 10-100 users  | 200-1000 users | 1100-2000 users |
      +--------------+---------------+----------------+-----------------+
      | alltests     |     +0.6%     |   +104.2%      |   +185.9%       |
      | five_sec     |     +1.9%     |     +0.9%      |     +0.9%       |
      | fserver      |     +1.4%     |     -7.7%      |     +5.1%       |
      | new_fserver  |     -0.5%     |     +3.2%      |     +3.1%       |
      | shared       |    +13.1%     |   +146.1%      |   +181.5%       |
      | short        |     +7.4%     |     +5.0%      |     +4.2%       |
      +--------------+---------------+----------------+-----------------+
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chandramouleeswaran Aswin <aswin@hp.com>
      Cc: Norton: Scott J <scott.norton@hp.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0dc8c730
    • W
      mutex: Move mutex spinning code from sched/core.c back to mutex.c · 41fcb9f2
      Waiman Long 提交于
      As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
      feature bit was really just an early hack to make with/without
      mutex-spinning testable. So it is no longer necessary.
      
      This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
      move the mutex spinning code from kernel/sched/core.c back to
      kernel/mutex.c which is where they should belong.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chandramouleeswaran Aswin <aswin@hp.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Norton Scott J <scott.norton@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      41fcb9f2
  2. 10 4月, 2013 1 次提交
  3. 08 4月, 2013 1 次提交
  4. 01 4月, 2013 1 次提交
    • P
      Revert "lockdep: check that no locks held at freeze time" · dbf520a9
      Paul Walmsley 提交于
      This reverts commit 6aa97070.
      
      Commit 6aa97070 ("lockdep: check that no locks held at freeze time")
      causes problems with NFS root filesystems.  The failures were noticed on
      OMAP2 and 3 boards during kernel init:
      
        [ BUG: swapper/0/1 still has locks held! ]
        3.9.0-rc3-00344-ga937536b #1 Not tainted
        -------------------------------------
        1 lock held by swapper/0/1:
         #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574
      
        stack backtrace:
          rpc_wait_bit_killable
          __wait_on_bit
          out_of_line_wait_on_bit
          __rpc_execute
          rpc_run_task
          rpc_call_sync
          nfs_proc_get_root
          nfs_get_root
          nfs_fs_mount_common
          nfs_try_mount
          nfs_fs_mount
          mount_fs
          vfs_kern_mount
          do_mount
          sys_mount
          do_mount_root
          mount_root
          prepare_namespace
          kernel_init_freeable
          kernel_init
      
      Although the rootfs mounts, the system is unstable.  Here's a transcript
      from a PM test:
      
        http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt
      
      Here's what the test log should look like:
      
        http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt
      
      Mailing list discussion is here:
      
        http://lkml.org/lkml/2013/3/4/221
      
      Deal with this for v3.9 by reverting the problem commit, until folks can
      figure out the right long-term course of action.
      Signed-off-by: NPaul Walmsley <paul@pwsan.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: <maciej.rutecki@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ben Chan <benchan@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbf520a9
  5. 27 3月, 2013 2 次提交
    • E
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman 提交于
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
    • E
      userns: Don't allow creation if the user is chrooted · 3151527e
      Eric W. Biederman 提交于
      Guarantee that the policy of which files may be access that is
      established by setting the root directory will not be violated
      by user namespaces by verifying that the root directory points
      to the root of the mount namespace at the time of user namespace
      creation.
      
      Changing the root is a privileged operation, and as a matter of policy
      it serves to limit unprivileged processes to files below the current
      root directory.
      
      For reasons of simplicity and comprehensibility the privilege to
      change the root directory is gated solely on the CAP_SYS_CHROOT
      capability in the user namespace.  Therefore when creating a user
      namespace we must ensure that the policy of which files may be access
      can not be violated by changing the root directory.
      
      Anyone who runs a processes in a chroot and would like to use user
      namespace can setup the same view of filesystems with a mount
      namespace instead.  With this result that this is not a practical
      limitation for using user namespaces.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      3151527e
  6. 26 3月, 2013 1 次提交
  7. 23 3月, 2013 2 次提交
    • O
      poweroff: change orderly_poweroff() to use schedule_work() · 2ca067ef
      Oleg Nesterov 提交于
      David said:
      
          Commit 6c0c0d4d ("poweroff: fix bug in orderly_poweroff()")
          apparently fixes one bug in orderly_poweroff(), but introduces
          another.  The comments on orderly_poweroff() claim it can be called
          from any context - and indeed we call it from interrupt context in
          arch/powerpc/platforms/pseries/ras.c for example.  But since that
          commit this is no longer safe, since call_usermodehelper_fns() is not
          safe in interrupt context without the UMH_NO_WAIT option.
      
      orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
      sleepable.  Move the "force" logic into __orderly_poweroff() and change
      orderly_poweroff() to use the global poweroff_work which simply calls
      __orderly_poweroff().
      
      While at it, remove the unneeded "int argc" and change argv_split() to
      use GFP_KERNEL.
      
      We use the global "bool poweroff_force" to pass the argument, this can
      obviously affect the previous request if it is pending/running.  So we
      only allow the "false => true" transition assuming that the pending
      "true" should succeed anyway.  If schedule_work() fails after that we
      know that work->func() was not called yet, it must see the new value.
      
      This means that orderly_poweroff() becomes async even if we do not run
      the command and always succeeds, schedule_work() can only fail if the
      work is already pending.  We can export __orderly_poweroff() and change
      the non-atomic callers which want the old semantics.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
      Cc: Feng Hong <hongfeng@marvell.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca067ef
    • F
      printk: Provide a wake_up_klogd() off-case · dc72c32e
      Frederic Weisbecker 提交于
      wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk()
      nor printk_sched() are in use and there are actually no waiter on
      log_wait waitqueue.  It should be a stub in this case for users like
      bust_spinlocks().
      
      Otherwise this results in this warning when CONFIG_PRINTK=n and
      CONFIG_IRQ_WORK=n:
      
      	kernel/built-in.o In function `wake_up_klogd':
      	(.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue'
      
      To fix this, provide an off-case for wake_up_klogd() when
      CONFIG_PRINTK=n.
      
      There is much more from console_unlock() and other console related code
      in printk.c that should be moved under CONFIG_PRINTK.  But for now,
      focus on a minimal fix as we passed the merged window already.
      
      [akpm@linux-foundation.org: include printk.h in bust_spinlocks.c]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc72c32e
  8. 18 3月, 2013 2 次提交
    • N
      perf: Generate EXIT event only once per task context · d610d98b
      Namhyung Kim 提交于
      perf_event_task_event() iterates pmu list and generate events
      for each eligible pmu context.  But if task_event has task_ctx
      like in EXIT it'll generate events even though the pmu doesn't
      have an eligible one. Fix it by moving the code to proper
      places.
      
      Before this patch:
      
        $ perf record -n true
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ]
      
        $ perf report -D | tail
        Aggregated stats:
                   TOTAL events:         73
                    MMAP events:         67
                    COMM events:          2
                    EXIT events:          4
        cycles stats:
                   TOTAL events:         73
                    MMAP events:         67
                    COMM events:          2
                    EXIT events:          4
      
      After this patch:
      
        $ perf report -D | tail
        Aggregated stats:
                   TOTAL events:         70
                    MMAP events:         67
                    COMM events:          2
                    EXIT events:          1
        cycles stats:
                   TOTAL events:         70
                    MMAP events:         67
                    COMM events:          2
                    EXIT events:          1
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d610d98b
    • N
      perf: Reset hwc->last_period on sw clock events · 778141e3
      Namhyung Kim 提交于
      When cpu/task clock events are initialized, their sampling
      frequencies are converted to have a fixed value.  However it
      missed to update the hwc->last_period which was set to 1 for
      initial sampling frequency calibration.
      
      Because this hwc->last_period value is used as a period in
      perf_swevent_ hrtime(), every recorded sample will have an
      incorrected period of 1.
      
        $ perf record -e task-clock noploop 1
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ]
      
        $ perf report -n --show-total-period  --stdio
        # Samples: 4K of event 'task-clock'
        # Event count (approx.): 4000
        #
        # Overhead       Samples        Period  Command  Shared Object              Symbol
        # ........  ............  ............  .......  .............  ..................
        #
            99.95%          3998          3998  noploop  noploop        [.] main
             0.03%             1             1  noploop  libc-2.15.so   [.] init_cacheinfo
             0.03%             1             1  noploop  ld-2.15.so     [.] open_verify
      
      Note that it doesn't affect the non-sampling event so that the
      perf stat still gets correct value with or without this patch.
      
        $ perf stat -e task-clock noploop 1
      
         Performance counter stats for 'noploop 1':
      
               1000.272525 task-clock                #    1.000 CPUs utilized
      
               1.000560605 seconds time elapsed
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      778141e3
  9. 15 3月, 2013 3 次提交
  10. 14 3月, 2013 5 次提交
    • T
      workqueue: convert to idr_alloc() · e68035fb
      Tejun Heo 提交于
      idr_get_new*() and friends are about to be deprecated.  Convert to the
      new idr_alloc() interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e68035fb
    • A
      kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORER · 522cff14
      Andrew Morton 提交于
      __ARCH_HAS_SA_RESTORER is the preferred conditional for use in 3.9 and
      later kernels, per Kees.
      
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      522cff14
    • K
      signal: always clear sa_restorer on execve · 2ca39528
      Kees Cook 提交于
      When the new signal handlers are set up, the location of sa_restorer is
      not cleared, leaking a parent process's address space location to
      children.  This allows for a potential bypass of the parent's ASLR by
      examining the sa_restorer value returned when calling sigaction().
      
      Based on what should be considered "secret" about addresses, it only
      matters across the exec not the fork (since the VMAs haven't changed
      until the exec).  But since exec sets SIG_DFL and keeps sa_restorer,
      this is where it should be fixed.
      
      Given the few uses of sa_restorer, a "set" function was not written
      since this would be the only use.  Instead, we use
      __ARCH_HAS_SA_RESTORER, as already done in other places.
      
      Example of the leak before applying this patch:
      
        $ cat /proc/$$/maps
        ...
        7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
        ...
        $ ./leak
        ...
        7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so
        ...
        1 0 (nil) 0x7fb9f30b94a0
        2 4000000 (nil) 0x7f278bcaa4a0
        3 4000000 (nil) 0x7f278bcaa4a0
        4 0 (nil) 0x7fb9f30b94a0
        ...
      
      [akpm@linux-foundation.org: use SA_RESTORER for backportability]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NEmese Revfy <re.emese@gmail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Julien Tinnes <jln@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca39528
    • E
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman 提交于
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66eded8
    • S
      tracing: Fix free of probe entry by calling call_rcu_sched() · 740466bc
      Steven Rostedt (Red Hat) 提交于
      Because function tracing is very invasive, and can even trace
      calls to rcu_read_lock(), RCU access in function tracing is done
      with preempt_disable_notrace(). This requires a synchronize_sched()
      for updates and not a synchronize_rcu().
      
      Function probes (traceon, traceoff, etc) must be freed after
      a synchronize_sched() after its entry has been removed from the
      hash. But call_rcu() is used. Fix this by using call_rcu_sched().
      
      Also fix the usage to use hlist_del_rcu() instead of hlist_del().
      
      Cc: stable@vger.kernel.org
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      740466bc
  11. 13 3月, 2013 2 次提交
  12. 12 3月, 2013 1 次提交
    • S
      tracing: Fix race in snapshot swapping · 2721e72d
      Steven Rostedt (Red Hat) 提交于
      Although the swap is wrapped with a spin_lock, the assignment
      of the temp buffer used to swap is not within that lock.
      It needs to be moved into that lock, otherwise two swaps
      happening on two different CPUs, can end up using the wrong
      temp buffer to assign in the swap.
      
      Luckily, all current callers of the swap function appear to have
      their own locks. But in case something is added that allows two
      different callers to call the swap, then there's a chance that
      this race can trigger and corrupt the buffers.
      
      New code is coming soon that will allow for this race to trigger.
      
      I've Cc'd stable, so this bug will not show up if someone backports
      one of the changes that can trigger this bug.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2721e72d
  13. 09 3月, 2013 2 次提交
    • L
      workqueue: fix possible pool stall bug in wq_unbind_fn() · eb283428
      Lai Jiangshan 提交于
      Since multiple pools per cpu have been introduced, wq_unbind_fn() has
      a subtle bug which may theoretically stall work item processing.  The
      problem is two-fold.
      
      * wq_unbind_fn() depends on the worker executing wq_unbind_fn() itself
        to start unbound chain execution, which works fine when there was
        only single pool.  With multiple pools, only the pool which is
        running wq_unbind_fn() - the highpri one - is guaranteed to have
        such kick-off.  The other pool could stall when its busy workers
        block.
      
      * The current code is setting WORKER_UNBIND / POOL_DISASSOCIATED of
        the two pools in succession without initiating work execution
        inbetween.  Because setting the flags requires grabbing assoc_mutex
        which is held while new workers are created, this could lead to
        stalls if a pool's manager is waiting for the previous pool's work
        items to release memory.  This is almost purely theoretical tho.
      
      Update wq_unbind_fn() such that it sets WORKER_UNBIND /
      POOL_DISASSOCIATED, goes over schedule() and explicitly kicks off
      execution for a pool and then moves on to the next one.
      
      tj: Updated comments and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      eb283428
    • A
      Revert parts of "hlist: drop the node parameter from iterators" · dc893e19
      Arnd Bergmann 提交于
      Commit b67bfe0d ("hlist: drop the node parameter from iterators")
      did a lot of nice changes but also contains two small hunks that seem to
      have slipped in accidentally and have no apparent connection to the
      intent of the patch.
      
      This reverts the two extraneous changes.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Peter Senna Tschudin <peter.senna@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc893e19
  14. 08 3月, 2013 1 次提交
    • M
      clockevents: Don't allow dummy broadcast timers · a7dc19b8
      Mark Rutland 提交于
      Currently tick_check_broadcast_device doesn't reject clock_event_devices
      with CLOCK_EVT_FEAT_DUMMY, and may select them in preference to real
      hardware if they have a higher rating value. In this situation, the
      dummy timer is responsible for broadcasting to itself, and the core
      clockevents code may attempt to call non-existent callbacks for
      programming the dummy, eventually leading to a panic.
      
      This patch makes tick_check_broadcast_device always reject dummy timers,
      preventing this problem.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a7dc19b8
  15. 07 3月, 2013 2 次提交
    • S
      tracing: Do not return EINVAL in snapshot when not allocated · c9960e48
      Steven Rostedt (Red Hat) 提交于
      To use the tracing snapshot feature, writing a '1' into the snapshot
      file causes the snapshot buffer to be allocated if it has not already
      been allocated and dose a 'swap' with the main buffer, so that the
      snapshot now contains what was in the main buffer, and the main buffer
      now writes to what was the snapshot buffer.
      
      To free the snapshot buffer, a '0' is written into the snapshot file.
      
      To clear the snapshot buffer, any number but a '0' or '1' is written
      into the snapshot file. But if the file is not allocated it returns
      -EINVAL error code. This is rather pointless. It is better just to
      do nothing and return success.
      Acked-by: NHiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c9960e48
    • S
      tracing: Add help of snapshot feature when snapshot is empty · d8741e2e
      Steven Rostedt (Red Hat) 提交于
      When cat'ing the snapshot file, instead of showing an empty trace
      header like the trace file does, show how to use the snapshot
      feature.
      
      Also, this is a good place to show if the snapshot has been allocated
      or not. Users may want to "pre allocate" the snapshot to have a fast
      "swap" of the current buffer. Otherwise, a swap would be slow and might
      fail as it would need to allocate the snapshot buffer, and that might
      fail under tight memory constraints.
      
      Here's what it looked like before:
      
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 0/0   #P:4
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
      
      Here's what it looks like now:
      
       # tracer: nop
       #
       #
       # * Snapshot is freed *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      Acked-by: NHiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d8741e2e
  16. 03 3月, 2013 2 次提交
    • A
      fix compat_sys_rt_sigprocmask() · db61ec29
      Al Viro 提交于
      Converting bitmask to 32bit granularity is fine, but we'd better
      _do_ something with the result.  Such as "copy it to userland"...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      db61ec29
    • J
      trace/ring_buffer: handle 64bit aligned structs · 649508f6
      James Hogan 提交于
      Some 32 bit architectures require 64 bit values to be aligned (for
      example Meta which has 64 bit read/write instructions). These require 8
      byte alignment of event data too, so use
      !CONFIG_HAVE_64BIT_ALIGNED_ACCESS instead of !CONFIG_64BIT ||
      CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to decide alignment, and align
      buffer_data_page::data accordingly.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: Steven Rostedt <rostedt@goodmis.org> (previous version subtly different)
      649508f6
  17. 02 3月, 2013 8 次提交