1. 29 11月, 2013 2 次提交
  2. 28 11月, 2013 2 次提交
    • T
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo 提交于
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • P
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra 提交于
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NJuri Lelli <juri.lelli@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
  3. 27 11月, 2013 1 次提交
  4. 26 11月, 2013 2 次提交
    • S
      ftrace: Fix function graph with loading of modules · 8a56d776
      Steven Rostedt (Red Hat) 提交于
      Commit 8c4f3c3f "ftrace: Check module functions being traced on reload"
      fixed module loading and unloading with respect to function tracing, but
      it missed the function graph tracer. If you perform the following
      
       # cd /sys/kernel/debug/tracing
       # echo function_graph > current_tracer
       # modprobe nfsd
       # echo nop > current_tracer
      
      You'll get the following oops message:
      
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
       Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
       CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
        0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
        ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
       Call Trace:
        [<ffffffff814fe193>] dump_stack+0x4f/0x7c
        [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
        [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
        [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
        [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
        [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
        [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
        [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
        [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
        [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
        [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
        [<ffffffff81122aed>] vfs_write+0xab/0xf6
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff81122dbd>] SyS_write+0x59/0x82
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
       ---[ end trace 940358030751eafb ]---
      
      The above mentioned commit didn't go far enough. Well, it covered the
      function tracer by adding checks in __register_ftrace_function(). The
      problem is that the function graph tracer circumvents that (for a slight
      efficiency gain when function graph trace is running with a function
      tracer. The gain was not worth this).
      
      The problem came with ftrace_startup() which should always be called after
      __register_ftrace_function(), if you want this bug to be completely fixed.
      
      Anyway, this solution moves __register_ftrace_function() inside of
      ftrace_startup() and removes the need to call them both.
      Reported-by: NDave Wysochanski <dwysocha@redhat.com>
      Fixes: ed926f9b ("ftrace: Use counters to enable functions to trace")
      Cc: stable@vger.kernel.org # 3.0+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8a56d776
    • L
      irq: Enable all irqs unconditionally in irq_resume · ac01810c
      Laxman Dewangan 提交于
      When the system enters suspend, it disables all interrupts in
      suspend_device_irqs(), including the interrupts marked EARLY_RESUME.
      
      On the resume side things are different. The EARLY_RESUME interrupts
      are reenabled in sys_core_ops->resume and the non EARLY_RESUME
      interrupts are reenabled in the normal system resume path.
      
      When suspend_noirq() failed or suspend is aborted for any other
      reason, we might omit the resume side call to sys_core_ops->resume()
      and therefor the interrupts marked EARLY_RESUME are not reenabled and
      stay disabled forever.
      
      To solve this, enable all irqs unconditionally in irq_resume()
      regardless whether interrupts marked EARLY_RESUMEhave been already
      enabled or not.
      
      This might try to reenable already enabled interrupts in the non
      failure case, but the only affected platform is XEN and it has been
      confirmed that it does not cause any side effects.
      
      [ tglx: Massaged changelog. ]
      Signed-off-by: NLaxman Dewangan <ldewangan@nvidia.com>
      Acked-by-and-tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NPavel Machek <pavel@ucw.cz>
      Cc: <ian.campbell@citrix.com>
      Cc: <rjw@rjwysocki.net>
      Cc: <len.brown@intel.com>
      Cc: <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ac01810c
  5. 23 11月, 2013 6 次提交
    • L
      workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues · 4e8b22bd
      Li Bin 提交于
      When one work starts execution, the high bits of work's data contain
      pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
      is assigned WORK_OFFQ_POOL_NONE when the work being initialized
      indicating that no pool is associated and get_work_pool() uses it to
      check the associated pool. So if worker_pool_assign_id() assigns a
      ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
      leakage, and it may break the non-reentrance guarantee.
      
      This patch fix this issue by modifying the worker_pool_assign_id()
      function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.
      
      Furthermore, in the current implementation, the BUILD_BUG_ON() in
      init_workqueues makes no sense. The number of worker pools needed
      cannot be determined at compile time, because the number of backing
      pools for UNBOUND workqueues is dynamic based on the assigned custom
      attributes. So remove it.
      
      tj: Minor comment and indentation updates.
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4e8b22bd
    • L
      workqueue: fix comment typo for __queue_work() · 9ef28a73
      Li Bin 提交于
      It seems the "dying" should be "draining" here.
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9ef28a73
    • T
      workqueue: fix ordered workqueues in NUMA setups · 8a2b7538
      Tejun Heo 提交于
      An ordered workqueue implements execution ordering by using single
      pool_workqueue with max_active == 1.  On a given pool_workqueue, work
      items are processed in FIFO order and limiting max_active to 1
      enforces the queued work items to be processed one by one.
      
      Unfortunately, 4c16bd32 ("workqueue: implement NUMA affinity for
      unbound workqueues") accidentally broke this guarantee by applying
      NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
      workqueue would end up with separate pool_workqueues for different
      nodes.  Each pool_workqueue still limits max_active to 1 but multiple
      work items may be executed concurrently and out of order depending on
      which node they are queued to.
      
      Fix it by using dedicated ordered_wq_attrs[] when creating ordered
      workqueues.  The new attrs match the unbound ones except that no_numa
      is always set thus forcing all NUMA nodes to share the default
      pool_workqueue.
      
      While at it, add sanity check in workqueue creation path which
      verifies that an ordered workqueues has only the default
      pool_workqueue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLibin <huawei.libin@huawei.com>
      Cc: stable@vger.kernel.org
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      8a2b7538
    • O
      workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY · 91151228
      Oleg Nesterov 提交于
      Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
      in create_worker(). Otherwise userland can change ->cpus_allowed
      in between.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      91151228
    • T
      cgroup: use a dedicated workqueue for cgroup destruction · e5fca243
      Tejun Heo 提交于
      Since be445626 ("cgroup: remove synchronize_rcu() from
      cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
      freeing is performed from a work item from that point on and a later
      commit, ea15f8cc ("cgroup: split cgroup destruction into two
      steps"), moves css offlining to workqueue too.
      
      As cgroup destruction isn't depended upon for memory reclaim, the
      destruction work items were put on the system_wq; unfortunately, some
      controller may block in the destruction path for considerable duration
      while holding cgroup_mutex.  As large part of destruction path is
      synchronized through cgroup_mutex, when combined with high rate of
      cgroup removals, this has potential to fill up system_wq's max_active
      of 256.
      
      Also, it turns out that memcg's css destruction path ends up queueing
      and waiting for work items on system_wq through work_on_cpu().  If
      such operation happens while system_wq is fully occupied by cgroup
      destruction work items, work_on_cpu() can't make forward progress
      because system_wq is full and other destruction work items on
      system_wq can't make forward progress because the work item waiting
      for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
      
      This can be fixed by queueing destruction work items on a separate
      workqueue.  This patch creates a dedicated workqueue -
      cgroup_destroy_wq - for this purpose.  As these work items shouldn't
      have inter-dependencies and mostly serialized by cgroup_mutex anyway,
      giving high concurrency level doesn't buy anything and the workqueue's
      @max_active is set to 1 so that destruction work items are executed
      one by one on each CPU.
      
      Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
      cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
      separate core_initcall().  In the future, we probably want to reorder
      so that workqueue init happens before cgroup_init().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NShawn Bohrer <shawn.bohrer@gmail.com>
      Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
      Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
      Cc: stable@vger.kernel.org # v3.9+
      e5fca243
    • M
      time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD · 4be77398
      Martin Schwidefsky 提交于
      Since commit 1e75fa8b (time: Condense timekeeper.xtime
      into xtime_sec - merged in v3.6), there has been an problem
      with the error accounting in the timekeeping code, such that
      when truncating to nanoseconds, we round up to the next nsec,
      but the balancing adjustment to the ntp_error value was dropped.
      
      This causes 1ns per tick drift forward of the clock.
      
      In 3.7, this logic was isolated to only GENERIC_TIME_VSYSCALL_OLD
      architectures (s390, ia64, powerpc).
      
      The fix is simply to balance the accounting and to subtract the
      added nanosecond from ntp_error. This allows the internal long-term
      clock steering to keep the clock accurate.
      
      While this fix removes the regression added in 1e75fa8b, the
      ideal solution is to move away from GENERIC_TIME_VSYSCALL_OLD
      and use the new VSYSCALL method, which avoids entirely the
      nanosecond granular rounding, and the resulting short-term clock
      adjustment oscillation needed to keep long term accurate time.
      
      [ jstultz: Many thanks to Martin for his efforts identifying this
        	   subtle bug, and providing the fix. ]
      
      Originally-from: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable <stable@vger.kernel.org>  #v3.6+
      Link: http://lkml.kernel.org/r/1385149491-20307-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4be77398
  6. 20 11月, 2013 5 次提交
  7. 19 11月, 2013 6 次提交
  8. 16 11月, 2013 1 次提交
  9. 15 11月, 2013 10 次提交
  10. 13 11月, 2013 5 次提交