1. 28 5月, 2011 8 次提交
    • P
      rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state · cc3ce517
      Paul E. McKenney 提交于
      Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
      result in softlockup warnings.  Because some of RCU's kthreads can
      legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
      state in order to avoid those warnings.
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc3ce517
    • P
      rcu: Remove waitqueue usage for cpu, node, and boost kthreads · 08bca60a
      Peter Zijlstra 提交于
      It is not necessary to use waitqueues for the RCU kthreads because
      we always know exactly which thread is to be awakened.  In addition,
      wake_up() only issues an actual wakeup when there is a thread waiting on
      the queue, which was why there was an extra explicit wake_up_process()
      to get the RCU kthreads started.
      
      Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
      eliminates the need for the initial wake_up_process() and also shrinks
      the data structure size a bit.  The wakeup logic is placed in a new
      rcu_wait() macro.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      08bca60a
    • P
      rcu: Avoid acquiring rcu_node locks in timer functions · 8826f3b0
      Paul E. McKenney 提交于
      This commit switches manipulations of the rcu_node ->wakemask field
      to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
      acquiring the rcu_node lock.  This should avoid the following lockdep
      splat reported by Valdis Kletnieks:
      
      [   12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
      [   12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
      [   12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
      [   12.987691] hub 1-4:1.0: USB hub found
      [   12.987877] hub 1-4:1.0: 3 ports detected
      [   12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
      [   13.071471] udevadm used greatest stack depth: 3984 bytes left
      [   13.172129]
      [   13.172130] =======================================================
      [   13.172425] [ INFO: possible circular locking dependency detected ]
      [   13.172650] 2.6.39-rc6-mmotm0506 #1
      [   13.172773] -------------------------------------------------------
      [   13.172997] blkid/267 is trying to acquire lock:
      [   13.173009]  (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]
      [   13.173009] but task is already holding lock:
      [   13.173009]  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] which lock already depends on the new lock.
      [   13.173009]
      [   13.173009]
      [   13.173009] the existing dependency chain (in reverse order) is:
      [   13.173009]
      [   13.173009] -> #2 (rcu_node_level_0){..-...}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5
      [   13.173009]        [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7
      [   13.173009]        [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75
      [   13.173009]        [<ffffffff81030cc6>] update_curr+0x101/0x12e
      [   13.173009]        [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b
      [   13.173009]        [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68
      [   13.173009]        [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128
      [   13.173009]        [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c
      [   13.173009]        [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d
      [   13.173009]        [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18
      [   13.173009]        [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23
      [   13.173009]        [<ffffffff810b739c>] find_get_page+0xa9/0xb9
      [   13.173009]        [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d
      [   13.173009]        [<ffffffff810d1a25>] __do_fault+0x54/0x3e6
      [   13.173009]        [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed
      [   13.173009]        [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] -> #1 (&rq->lock){-.-.-.}:
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45
      [   13.173009]        [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3
      [   13.173009]        [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108
      [   13.173009]        [<ffffffff810376c3>] do_fork+0x265/0x33f
      [   13.173009]        [<ffffffff81007d02>] kernel_thread+0x6b/0x6d
      [   13.173009]        [<ffffffff8153a9dd>] rest_init+0x21/0xd2
      [   13.173009]        [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6
      [   13.173009]        [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3
      [   13.173009]        [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7
      [   13.173009]
      [   13.173009] -> #0 (&p->pi_lock){-.-.-.}:
      [   13.173009]        [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]        [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]        [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]        [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]        [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]        [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]        [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]        [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]        [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]        [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]        [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]        [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]        [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]        [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]        [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]        [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]        [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]        [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]        [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]        [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]        [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]        [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]        [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]        [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   13.173009]
      [   13.173009] other info that might help us debug this:
      [   13.173009]
      [   13.173009] Chain exists of:
      [   13.173009]   &p->pi_lock --> &rq->lock --> rcu_node_level_0
      [   13.173009]
      [   13.173009]  Possible unsafe locking scenario:
      [   13.173009]
      [   13.173009]        CPU0                    CPU1
      [   13.173009]        ----                    ----
      [   13.173009]   lock(rcu_node_level_0);
      [   13.173009]                                lock(&rq->lock);
      [   13.173009]                                lock(rcu_node_level_0);
      [   13.173009]   lock(&p->pi_lock);
      [   13.173009]
      [   13.173009]  *** DEADLOCK ***
      [   13.173009]
      [   13.173009] 3 locks held by blkid/267:
      [   13.173009]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de
      [   13.173009]  #1:  (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9
      [   13.173009]  #2:  (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58
      [   13.173009]
      [   13.173009] stack backtrace:
      [   13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
      [   13.173009] Call Trace:
      [   13.173009]  <IRQ>  [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9
      [   13.173009]  [<ffffffff81067788>] check_prev_add+0x68/0x20e
      [   13.173009]  [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46
      [   13.173009]  [<ffffffff810679b9>] check_prevs_add+0x8b/0x104
      [   13.173009]  [<ffffffff81067da1>] validate_chain+0x36f/0x3ab
      [   13.173009]  [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57
      [   13.173009]  [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81032f3c>] wake_up_process+0x10/0x12
      [   13.173009]  [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff81045286>] call_timer_fn+0xac/0x1e9
      [   13.173009]  [<ffffffff810451da>] ? del_timer+0x75/0x75
      [   13.173009]  [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82
      [   13.173009]  [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2
      [   13.173009]  [<ffffffff8103e487>] __do_softirq+0x109/0x26a
      [   13.173009]  [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff8157144c>] call_softirq+0x1c/0x30
      [   13.173009]  [<ffffffff81003207>] do_softirq+0x44/0xf1
      [   13.173009]  [<ffffffff8103e8b9>] irq_exit+0x58/0xc8
      [   13.173009]  [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87
      [   13.173009]  [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20
      [   13.173009]  <EOI>  [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310
      [   13.173009]  [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff812220e7>] ? clear_page_c+0x7/0x10
      [   13.173009]  [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd
      [   13.173009]  [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310
      [   13.173009]  [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a
      [   13.173009]  [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99
      [   13.173009]  [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b
      [   13.173009]  [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0
      [   13.173009]  [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de
      [   13.173009]  [<ffffffff810d915f>] ? sys_brk+0x32/0x10c
      [   13.173009]  [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f
      [   13.173009]  [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c
      [   13.173009]  [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
      [   13.173009]  [<ffffffff8156a75f>] page_fault+0x1f/0x30
      [   14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd
      Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8826f3b0
    • P
      perf: Fix SIGIO handling · f506b3dc
      Peter Zijlstra 提交于
      Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
      explicitly push the wakeup (including signals) when requested.
      Reported-by: NVince Weaver <vweaver1@eecs.utk.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f506b3dc
    • K
      cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51
      KOSAKI Motohiro 提交于
      The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
      tsk->cpus_allowed. Otherwise RT scheduler may confuse.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e1b6c51
    • P
      sched: Fix ->min_vruntime calculation in dequeue_entity() · 1e876231
      Peter Zijlstra 提交于
      Dima Zavin <dima@android.com> reported:
      
      "After pulling the thread off the run-queue during a cgroup change,
      the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
      then gets normalized to this new value. This can then lead to the thread
      getting an unfair boost in the new group if the vruntime of the next
      task in the old run-queue was way further ahead."
      Reported-by: NDima Zavin <dima@android.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Recalls-having-tested-once-upon-a-time-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1e876231
    • P
      sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW · d6aa8f85
      Peter Zijlstra 提交于
      Marc reported that e4a52bcb (sched: Remove rq->lock from the first
      half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
      __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
      code was suspect.
      
      Yong found that the interrupt could hit after context_switch() changes
      current but before it clears p->on_cpu, if that interrupt were to
      attempt a wake-up of p we would indeed find ourselves spinning in IRQ
      context.
      
      Fix this by reverting to the old behaviour for this situation and
      perform a full remote wake-up.
      
      Cc: Frank Rowand <frank.rowand@am.sony.com>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Reported-by: NMarc Zyngier <Marc.Zyngier@arm.com>
      Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6aa8f85
    • X
      sched: More sched_domain iterations fixes · cd4ae6ad
      Xiaotian Feng 提交于
      sched_domain iterations needs to be protected by rcu_read_lock() now,
      this patch adds another two places which needs the rcu lock, which is
      spotted by following suspicious rcu_dereference_check() usage warnings.
      
      kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
      kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cd4ae6ad
  2. 27 5月, 2011 13 次提交
    • R
      kernel/profile.c: remove some duplicate code from profile_hits() · 6f7bd76f
      Rakib Mullick 提交于
      profile_hits() has a common check for prof_on and prof_buffer regardless
      of SMP or !SMP.  So, remove some duplicate code by splitting profile_hits
      into two.
      
      [akpm@linux-foundation.org: make do_profile_hits static]
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f7bd76f
    • J
      mm: extract exe_file handling from procfs · 38646013
      Jiri Slaby 提交于
      Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
      This was because exe_file was needed only for /proc/<pid>/exe.  Since we
      will need the exe_file functionality also for core dumps (so core name can
      contain full binary path), built this functionality always into the
      kernel.
      
      To achieve that move that out of proc FS to the kernel/ where in fact it
      should belong.  By doing that we can make dup_mm_exe_file static.  Also we
      can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38646013
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: use flex_array in attach_proc · d846687d
      Ben Blum 提交于
      Convert cgroup_attach_proc to use flex_array.
      
      The cgroup_attach_proc implementation requires a pre-allocated array to
      store task pointers to atomically move a thread-group, but asking for a
      monolithic array with kmalloc() may be unreliable for very large groups.
      Using flex_array provides the same functionality with less risk of
      failure.
      
      This is a post-patch for cgroup-procs-write.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d846687d
    • B
      cgroups: make procs file writable · 74a1166d
      Ben Blum 提交于
      Make procs file writable to move all threads by tgid at once.
      
      Add functionality that enables users to move all threads in a threadgroup
      at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
      current implementation makes use of a per-threadgroup rwsem that's taken
      for reading in the fork() path to prevent newly forking threads within the
      threadgroup from "escaping" while the move is in progress.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74a1166d
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
    • B
      cgroups: read-write lock CLONE_THREAD forking per threadgroup · 4714d1d3
      Ben Blum 提交于
      Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
      
      Add an rwsem that lives in a threadgroup's signal_struct that's taken for
      reading in the fork path, under CONFIG_CGROUPS.  If another part of the
      kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
      ifdefs should be changed to a higher-up flag that CGROUPS and the other
      system would both depend on.
      
      This is a pre-patch for cgroup-procs-write.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4714d1d3
    • R
      PM: Fix PM QOS's user mode interface to work with ASCII input · 0775a60a
      Rafael J. Wysocki 提交于
      Make pm_qos_power_write() accept values passed to it in the ASCII hex
      format either with or without an ending newline.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NMark Gross <markgross@thegnar.org>
      0775a60a
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · 23b5c8fa
      Paul E. McKenney 提交于
      (Note: this was reverted, and is now being re-applied in pieces, with
      this being the fifth and final piece.  See below for the reason that
      it is now felt to be safe to re-apply this.)
      
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      
      This was reverted do to hangs found during testing by Yinghai Lu and
      Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
      located the underlying problem, and Frederic also provided the fix.
      
      The underlying problem was that the HARDIRQ_ENTER() macro from
      lib/locking-selftest.c invoked irq_enter(), which in turn invokes
      rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
      does not invoke rcu_irq_exit().  This situation resulted in calls
      to rcu_irq_enter() that were not balanced by the required calls to
      rcu_irq_exit().  Therefore, after these locking selftests completed,
      RCU's dyntick-idle nesting count was a large number (for example,
      72), which caused RCU to to conclude that the affected CPU was not in
      dyntick-idle mode when in fact it was.
      
      RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
      in hangs.
      
      In contrast, with Frederic's patch, which replaces the irq_enter()
      in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
      either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
      running the test is already marked as not being in dyntick-idle mode.
      This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
      then has no problem working out which CPUs are in dyntick-idle mode and
      which are not.
      
      The reason that the imbalance was not noticed before the barrier patch
      was applied is that the old implementation of rcu_enter_nohz() ignored
      the nesting depth.  This could still result in delays, but much shorter
      ones.  Whenever there was a delay, RCU would IPI the CPU with the
      unbalanced nesting level, which would eventually result in rcu_enter_nohz()
      being called, which in turn would force RCU to see that the CPU was in
      dyntick-idle mode.
      
      The reason that very few people noticed the problem is that the mismatched
      irq_enter() vs. __irq_exit() occured only when the kernel was built with
      CONFIG_DEBUG_LOCKING_API_SELFTESTS.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      23b5c8fa
    • P
      rcu: Make rcu_enter_nohz() pay attention to nesting · 4305ce78
      Paul E. McKenney 提交于
      The old version of rcu_enter_nohz() forced RCU into nohz mode even if
      the nesting count was non-zero.  This change causes rcu_enter_nohz()
      to hold off for non-zero nesting counts.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4305ce78
    • P
      rcu: Don't do reschedule unless in irq · b5904090
      Paul E. McKenney 提交于
      Condition the set_need_resched() in rcu_irq_exit() on in_irq().  This
      should be a no-op, because rcu_irq_exit() should only be called from irq.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b5904090
    • P
      rcu: Remove old memory barriers from rcu_process_callbacks() · 1135633b
      Paul E. McKenney 提交于
      Second step of partitioning of commit e59fb312.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1135633b
    • P
      rcu: Add memory barriers · 0bbcc529
      Paul E. McKenney 提交于
      Add the memory barriers added by e59fb312.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0bbcc529
  3. 26 5月, 2011 10 次提交
  4. 25 5月, 2011 8 次提交
    • M
      printk: allocate kernel log buffer earlier · 162a7e75
      Mike Travis 提交于
      On larger systems, because of the numerous ACPI, Bootmem and EFI messages,
      the static log buffer overflows before the larger one specified by the
      log_buf_len param is allocated.  Minimize the overflow by allocating the
      new log buffer as soon as possible.
      
      On kernels without memblock, a later call to setup_log_buf from
      kernel/init.c is the fallback.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_PRINTK=n build]
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      162a7e75
    • M
      bitmap, irq: add smp_affinity_list interface to /proc/irq · 4b060420
      Mike Travis 提交于
      Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the
      cpu count is large.
      
      Setting smp affinity to cpus 256 to 263 would be:
      
      	echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity
      
      instead of:
      
      	echo 256-263 > smp_affinity_list
      
      Think about what it looks like for cpus around say, 4088 to 4095.
      
      We already have many alternate "list" interfaces:
      
      /sys/devices/system/cpu/cpuX/indexY/shared_cpu_list
      /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
      /sys/devices/system/cpu/cpuX/topology/core_siblings_list
      /sys/devices/system/node/nodeX/cpulist
      /sys/devices/pci***/***/local_cpulist
      
      Add a companion interface, smp_affinity_list to use cpu lists instead of
      cpu maps.  This conforms to other companion interfaces where both a map
      and a list interface exists.
      
      This required adding a bitmap_parselist_user() function in a manner
      similar to the bitmap_parse_user() function.
      
      [akpm@linux-foundation.org: make __bitmap_parselist() static]
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b060420
    • K
      mm: convert mm->cpu_vm_cpumask into cpumask_var_t · de03c72c
      KOSAKI Motohiro 提交于
      cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
      It might lead to reduce cache hit ratio.
      
      This patch has two change.
      1) Move the place of cpumask into last of mm_struct. Because usually cpumask
         is accessed only front bits when the system has cpu-hotplug capability
      2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
         footprint if cpumask_size() will use nr_cpumask_bits properly in future.
      
      In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
      It may help to detect out of tree cpu_vm_mask users.
      
      This patch has no functional change.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de03c72c
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
    • P
      lockdep, mutex: provide mutex_lock_nest_lock · e4c70a66
      Peter Zijlstra 提交于
      In order to convert i_mmap_lock to a mutex we need a mutex equivalent to
      spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation.
      
      As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of
      the locking pattern where an outer lock serializes the acquisition order
      of nested locks.  That is, if every time you lock multiple locks A, say A1
      and A2 you first acquire N, the order of acquiring A1 and A2 is
      irrelevant.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4c70a66
    • R
      PM / Hibernate: Update kerneldoc comments in hibernate.c · f42a9813
      Rafael J. Wysocki 提交于
      Some of the kerneldoc comments in kernel/power/hibernate.c are
      outdated and some of them don't adhere to the kernel's standards.
      Update them and make them look in a consistent way.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      f42a9813
    • R
      PM / Hibernate: Remove arch_prepare_suspend() · 35425801
      Rafael J. Wysocki 提交于
      All architectures supporting hibernation define
      arch_prepare_suspend() as an empty function, so remove it.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      35425801
  5. 24 5月, 2011 1 次提交
    • E
      posix-timers: RCU conversion · 8af08871
      Eric Dumazet 提交于
      Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
      a single spinlock (idr_lock) in posix-timers code, on its 48 core
      machine.
      
      Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
      used in ticket_spin_lock, from lock_timer
      
      Ref: http://www.spinics.net/lists/kvm/msg51526.html
      
      Switching to RCU is quite easy, IDR being already RCU ready. idr_lock
      should be locked only for an insert/delete, not a lookup.
      
      Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime().
      
      Before :
      
      real    1m18.669s
      user    0m1.346s
      sys     1m17.180s
      
      After :
      
      real    0m3.296s
      user    0m1.366s
      sys     0m1.926s
      Reported-by: NBen Nagy <ben@iagu.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NBen Nagy <ben@iagu.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: Richard Cochran <richard.cochran@omicron.at>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      8af08871