1. 07 8月, 2010 2 次提交
    • H
      tracing: Fix ring_buffer_read_page reading out of page boundary · 18fab912
      Huang Ying 提交于
      With the configuration: CONFIG_DEBUG_PAGEALLOC=y and Shaohua's patch:
      
      [PATCH]x86: make spurious_fault check correct pte bit
      
      Function call graph trace with the following will trigger a page fault.
      
      # cd /sys/kernel/debug/tracing/
      # echo function_graph > current_tracer
      # cat per_cpu/cpu1/trace_pipe_raw > /dev/null
      
      BUG: unable to handle kernel paging request at ffff880006e99000
      IP: [<ffffffff81085572>] rb_event_length+0x1/0x3f
      PGD 1b19063 PUD 1b1d063 PMD 3f067 PTE 6e99160
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      last sysfs file: /sys/devices/virtual/net/lo/operstate
      CPU 1
      Modules linked in:
      
      Pid: 1982, comm: cat Not tainted 2.6.35-rc6-aes+ #300 /Bochs
      RIP: 0010:[<ffffffff81085572>]  [<ffffffff81085572>] rb_event_length+0x1/0x3f
      RSP: 0018:ffff880006475e38  EFLAGS: 00010006
      RAX: 0000000000000ff0 RBX: ffff88000786c630 RCX: 000000000000001d
      RDX: ffff880006e98000 RSI: 0000000000000ff0 RDI: ffff880006e99000
      RBP: ffff880006475eb8 R08: 000000145d7008bd R09: 0000000000000000
      R10: 0000000000008000 R11: ffffffff815d9336 R12: ffff880006d08000
      R13: ffff880006e605d8 R14: 0000000000000000 R15: 0000000000000018
      FS:  00007f2b83e456f0(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: ffff880006e99000 CR3: 00000000064a8000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process cat (pid: 1982, threadinfo ffff880006474000, task ffff880006e40770)
      Stack:
       ffff880006475eb8 ffffffff8108730f 0000000000000ff0 000000145d7008bd
      <0> ffff880006e98010 ffff880006d08010 0000000000000296 ffff88000786c640
      <0> ffffffff81002956 0000000000000000 ffff8800071f4680 ffff8800071f4680
      Call Trace:
       [<ffffffff8108730f>] ? ring_buffer_read_page+0x15a/0x24a
       [<ffffffff81002956>] ? return_to_handler+0x15/0x2f
       [<ffffffff8108a575>] tracing_buffers_read+0xb9/0x164
       [<ffffffff810debfe>] vfs_read+0xaf/0x150
       [<ffffffff81002941>] return_to_handler+0x0/0x2f
       [<ffffffff810248b0>] __bad_area_nosemaphore+0x17e/0x1a1
       [<ffffffff81002941>] return_to_handler+0x0/0x2f
       [<ffffffff810248e6>] bad_area_nosemaphore+0x13/0x15
      Code: 80 25 b2 16 b3 00 fe c9 c3 55 48 89 e5 f0 80 0d a4 16 b3 00 02 c9 c3 55 31 c0 48 89 e5 48 83 3d 94 16 b3 00 01 c9 0f 94 c0 c3 55 <8a> 0f 48 89 e5 83 e1 1f b8 08 00 00 00 0f b6 d1 83 fa 1e 74 27
      RIP  [<ffffffff81085572>] rb_event_length+0x1/0x3f
       RSP <ffff880006475e38>
      CR2: ffff880006e99000
      ---[ end trace a6877bb92ccb36bb ]---
      
      The root cause is that ring_buffer_read_page() may read out of page
      boundary, because the boundary checking is done after reading. This is
      fixed via doing boundary checking before reading.
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1280297641.2771.307.camel@yhuang-dev>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      18fab912
    • S
      tracing: Fix an unallocated memory access in function_graph · 575570f0
      Shaohua Li 提交于
      With CONFIG_DEBUG_PAGEALLOC, I observed an unallocated memory access in
      function_graph trace. It appears we find a small size entry in ring buffer,
      but we access it as a big size entry. The access overflows the page size
      and touches an unallocated page.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1280217994.32400.76.camel@sli10-desk.sh.intel.com>
      [ Added a comment to explain the problem - SDR ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      575570f0
  2. 05 7月, 2010 1 次提交
  3. 01 7月, 2010 2 次提交
    • P
      sched: Cure nr_iowait_cpu() users · 8c215bd3
      Peter Zijlstra 提交于
      Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
      broke things by not making sure preemption was indeed disabled
      by the callers of nr_iowait_cpu() which took the iowait value of
      the current cpu.
      
      This resulted in a heap of preempt warnings. Cure this by making
      nr_iowait_cpu() take a cpu number and fix up the callers to pass
      in the right number.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1277968037.1868.120.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8c215bd3
    • M
      futex: futex_find_get_task remove credentails check · 7a0ea09a
      Michal Hocko 提交于
      futex_find_get_task is currently used (through lookup_pi_state) from two
      contexts, futex_requeue and futex_lock_pi_atomic.  None of the paths
      looks it needs the credentials check, though.  Different (e)uids
      shouldn't matter at all because the only thing that is important for
      shared futex is the accessibility of the shared memory.
      
      The credentail check results in glibc assert failure or process hang (if
      glibc is compiled without assert support) for shared robust pthread
      mutex with priority inheritance if a process tries to lock already held
      lock owned by a process with a different euid:
      
      pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
      
      The problem is that futex_lock_pi_atomic which is called when we try to
      lock already held lock checks the current holder (tid is stored in the
      futex value) to get the PI state.  It uses lookup_pi_state which in turn
      gets task struct from futex_find_get_task.  ESRCH is returned either
      when the task is not found or if credentials check fails.
      
      futex_lock_pi_atomic simply returns if it gets ESRCH.  glibc code,
      however, doesn't expect that robust lock returns with ESRCH because it
      should get either success or owner died.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDarren Hart <dvhltc@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a0ea09a
  4. 30 6月, 2010 1 次提交
  5. 25 6月, 2010 1 次提交
  6. 24 6月, 2010 1 次提交
  7. 23 6月, 2010 1 次提交
    • D
      rcu: apply RCU protection to wake_affine() · f3b577de
      Daniel J Blueman 提交于
      The task_group() function returns a pointer that must be protected
      by either RCU, the ->alloc_lock, or the cgroup lock (see the
      rcu_dereference_check() in task_subsys_state(), which is invoked by
      task_group()).  The wake_affine() function currently does none of these,
      which means that a concurrent update would be within its rights to free
      the structure returned by task_group().  Because wake_affine() uses this
      structure only to compute load-balancing heuristics, there is no reason
      to acquire either of the two locks.
      
      Therefore, this commit introduces an RCU read-side critical section that
      starts before the first call to task_group() and ends after the last use
      of the "tg" pointer returned from task_group().  Thanks to Li Zefan for
      pointing out the need to extend the RCU read-side critical section from
      that proposed by the original patch.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f3b577de
  8. 18 6月, 2010 2 次提交
    • A
      sched: Fix over-scheduling bug · 3c93717c
      Alex,Shi 提交于
      Commit e7097159 ("sched: Optimize unused cgroup configuration") introduced
      an imbalanced scheduling bug.
      
      If we do not use CGROUP, function update_h_load won't update h_load. When the
      system has a large number of tasks far more than logical CPU number, the
      incorrect cfs_rq[cpu]->h_load value will cause load_balance() to pull too
      many tasks to the local CPU from the busiest CPU. So the busiest CPU keeps
      going in a round robin. That will hurt performance.
      
      The issue was found originally by a scientific calculation workload that
      developed by Yanmin. With that commit, the workload performance drops
      about 40%.
      
       CPU  before    after
      
       00   : 2       : 7
       01   : 1       : 7
       02   : 11      : 6
       03   : 12      : 7
       04   : 6       : 6
       05   : 11      : 7
       06   : 10      : 6
       07   : 12      : 7
       08   : 11      : 6
       09   : 12      : 6
       10   : 1       : 6
       11   : 1       : 6
       12   : 6       : 6
       13   : 2       : 6
       14   : 2       : 6
       15   : 1       : 6
      Reviewed-by: NYanmin zhang <yanmin.zhang@intel.com>
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1276754893.9452.5442.camel@debian>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c93717c
    • P
      nohz: Fix nohz ratelimit · 3310d4d3
      Peter Zijlstra 提交于
      Chris Wedgwood reports that 39c0cbe2 (sched: Rate-limit nohz) causes a
      serial console regression, unresponsiveness, and indeed it does. The
      reason is that the nohz code is skipped even when the tick was already
      stopped before the nohz_ratelimit(cpu) condition changed.
      
      Move the nohz_ratelimit() check to the other conditions which prevent
      long idle sleeps.
      Reported-by: NChris Wedgwood <cw@f00f.org>
      Tested-by: NBrian Bloniarz <bmb@athenacr.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg KH <gregkh@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Jef Driesen <jefdriesen@telenet.be>
      LKML-Reference: <1276790557.27822.516.camel@twins>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3310d4d3
  9. 11 6月, 2010 1 次提交
    • S
      perf/tracing: Fix regression of perf losing kprobe events · a8fb2608
      Steven Rostedt 提交于
      With the addition of the code to shrink the kernel tracepoint
      infrastructure, we lost kprobes being traced by perf. The reason
      is that I tested if the "tp_event->class->perf_probe" existed before
      enabling it. This prevents "ftrace only" events (like the function
      trace events) from being enabled by perf.
      
      Unfortunately, kprobe events do not use perf_probe. This causes
      kprobes to be missed by perf. To fix this, we add the test to
      see if "tp_event->class->reg" exists as well as perf_probe.
      
      Normal trace events have only "perf_probe" but no "reg" function,
      and kprobes and syscalls have the "reg" but no "perf_probe".
      The ftrace unique events do not have either, so this is a valid
      test. If a kprobe or syscall is not to be probed by perf, the
      "reg" function is called anyway, and will return a failure and
      prevent perf from probing it.
      Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a8fb2608
  10. 10 6月, 2010 1 次提交
  11. 09 6月, 2010 3 次提交
    • T
      genirq: Deal with desc->set_type() changing desc->chip · 46732475
      Thomas Gleixner 提交于
      The set_type() function can change the chip implementation when the
      trigger mode changes. That might result in using an non-initialized
      irq chip when called from __setup_irq() or when called via
      set_irq_type() on an already enabled irq. 
      
      The set_irq_type() function should not be called on an enabled irq,
      but because we forgot to put a check into it, we have a bunch of users
      which grew the habit of doing that and it never blew up as the
      function is serialized via desc->lock against all users of desc->chip
      and they never hit the non-initialized irq chip issue.
      
      The easy fix for the __setup_irq() issue would be to move the
      irq_chip_set_defaults(desc->chip) call after the trigger setting to
      make sure that a chip change is covered.
      
      But as we have already users, which do the type setting after
      request_irq(), the safe fix for now is to call irq_chip_set_defaults()
      from __irq_set_trigger() when desc->set_type() changed the irq chip.
      
      It needs a deeper analysis whether we should refuse to change the chip
      on an already enabled irq, but that'd be a large scale change to fix
      all the existing users. So that's neither stable nor 2.6.35 material.
      Reported-by: NEsben Haabendal <eha@doredevelopment.dk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
      Cc: stable@kernel.org
      46732475
    • P
      sched: Fix PROVE_RCU vs cpu_cgroup · dc61b1d6
      Peter Zijlstra 提交于
      PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
      typically holds rq->lock around the css rcu derefs but the generic
      cgroup code doesn't (and can't) know about that lock.
      
      Provide means to add extra checks to the css dereference and use that
      in the scheduler to annotate its users.
      
      The addition of rq->lock to these checks is correct because the
      cgroup_subsys::attach() method takes the rq->lock for each task it
      moves, therefore by holding that lock, we ensure the task is pinned to
      the current cgroup and the RCU derefence is valid.
      
      That leaves one genuine race in __sched_setscheduler() where we used
      task_group() without holding any of the required locks and thus raced
      with the cgroup code. Solve this by moving the check under the
      appropriate lock.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc61b1d6
    • P
      perf: Fix signed comparison in perf_adjust_period() · f6ab91ad
      Peter Zijlstra 提交于
      Frederic reported that frequency driven swevents didn't work properly
      and even caused a division-by-zero error.
      
      It turns out there are two bugs, the division-by-zero comes from a
      failure to deal with that in perf_calculate_period().
      
      The other was more interesting and turned out to be a wrong comparison
      in perf_adjust_period(). The comparison was between an s64 and u64 and
      got implicitly converted to an unsigned comparison. The problem is
      that period_left is typically < 0, so it ended up being always true.
      
      Cure this by making the local period variables s64.
      Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Tested-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6ab91ad
  12. 05 6月, 2010 11 次提交
    • R
      module: fix bne2 "gave up waiting for init of module libcrc32c" · 9bea7f23
      Rusty Russell 提交于
      Problem: it's hard to avoid an init routine stumbling over a
      request_module these days.  And it's not clear it's always a bad idea:
      for example, a module like kvm with dynamic dependencies on kvm-intel
      or kvm-amd would be neater if it could simply request_module the right
      one.
      
      In this particular case, it's libcrc32c:
      
      	libcrc32c_mod_init
      	 crypto_alloc_shash
      	  crypto_alloc_tfm
      	   crypto_find_alg
      	    crypto_alg_mod_lookup
      	     crypto_larval_lookup
      	      request_module
      
      If another module is waiting inside resolve_symbol() for libcrc32c to
      finish initializing (ie. bne2 depends on libcrc32c) then it does so
      holding the module lock, and our request_module() can't make progress
      until that is released.
      
      Waiting inside resolve_symbol() without the lock isn't all that hard:
      we just need to pass the -EBUSY up the call chain so we can sleep
      where we don't hold the lock.  Error reporting is a bit trickier: we
      need to copy the name of the unfinished module before releasing the
      lock.
      
      Other notes:
      1) This also fixes a theoretical issue where a weak dependency would allow
         symbol version mismatches to be ignored.
      2) We rename use_module to ref_module to make life easier for the only
         external user (the out-of-tree ksplice patches).
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tim Abbot <tabbott@ksplice.com>
      Tested-by: NBrandon Philips <bphilips@suse.de>
      9bea7f23
    • R
      module: verify_export_symbols under the lock · be593f4c
      Rusty Russell 提交于
      It disabled preempt so it was "safe", but nothing stops another module
      slipping in before this module is added to the global list now we don't
      hold the lock the whole time.
      
      So we check this just after we check for duplicate modules, and just
      before we put the module in the global list.
      
      (find_symbol finds symbols in coming and going modules, too).
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      be593f4c
    • L
      module: move find_module check to end · 3bafeb62
      Linus Torvalds 提交于
      I think Rusty may have made the lock a bit _too_ finegrained there, and
      didn't add it to some places that needed it. It looks, for example, like
      PATCH 1/2 actually drops the lock in places where it's needed
      ("find_module()" is documented to need it, but now load_module() didn't
      hold it at all when it did the find_module()).
      
      Rather than adding a new "module_loading" list, I think we should be able
      to just use the existing "modules" list, and just fix up the locking a
      bit.
      
      In fact, maybe we could just move the "look up existing module" a bit
      later - optimistically assuming that the module doesn't exist, and then
      just undoing the work if it turns out that we were wrong, just before
      adding ourselves to the list.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      3bafeb62
    • R
      module: make locking more fine-grained. · 75676500
      Rusty Russell 提交于
      Kay Sievers <kay.sievers@vrfy.org> reports that we still have some
      contention over module loading which is slowing boot.
      
      Linus also disliked a previous "drop lock and regrab" patch to fix the
      bne2 "gave up waiting for init of module libcrc32c" message.
      
      This is more ambitious: we only grab the lock where we need it.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Brandon Philips <brandon@ifup.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      75676500
    • R
      module: Make module sysfs functions private. · 6407ebb2
      Rusty Russell 提交于
      These were placed in the header in ef665c1a to get the various
      SYSFS/MODULE config combintations to compile.
      
      That may have been necessary then, but it's not now.  These functions
      are all local to module.c.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      6407ebb2
    • R
      module: move sysfs exposure to end of load_module · 80a3d1bb
      Rusty Russell 提交于
      This means a little extra work, but is more logical: we don't put
      anything in sysfs until we're about to put the module into the
      global list an parse its parameters.
      
      This also gives us a logical place to put duplicate module detection
      in the next patch.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      80a3d1bb
    • R
      module: fix kdb's illicit use of struct module_use. · c8e21ced
      Rusty Russell 提交于
      Linus changed the structure, and luckily this didn't compile any more.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Martin Hicks <mort@sgi.com>
      c8e21ced
    • L
      module: Make the 'usage' lists be two-way · 2c02dfe7
      Linus Torvalds 提交于
      When adding a module that depends on another one, we used to create a
      one-way list of "modules_which_use_me", so that module unloading could
      see who needs a module.
      
      It's actually quite simple to make that list go both ways: so that we
      not only can see "who uses me", but also see a list of modules that are
      "used by me".
      
      In fact, we always wanted that list in "module_unload_free()": when we
      unload a module, we want to also release all the other modules that are
      used by that module.  But because we didn't have that list, we used to
      first iterate over all modules, and then iterate over each "used by me"
      list of that module.
      
      By making the list two-way, we simplify module_unload_free(), and it
      allows for some trivial fixes later too.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)
      2c02dfe7
    • A
      kernel/: fix BUG_ON checks for cpu notifier callbacks direct call · 9e506f7a
      Akinobu Mita 提交于
      The commit 80b5184c ("kernel/: convert cpu
      notifier to return encapsulate errno value") changed the return value of
      cpu notifier callbacks.
      
      Those callbacks don't return NOTIFY_BAD on failures anymore.  But there
      are a few callbacks which are called directly at init time and checking
      the return value.
      
      I forgot to change BUG_ON checking by the direct callers in the commit.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e506f7a
    • G
      cgroups: alloc_css_id() increments hierarchy depth · 94b3dd0f
      Greg Thelen 提交于
      Child groups should have a greater depth than their parents.  Prior to
      this change, the parent would incorrectly report zero memory usage for
      child cgroups when use_hierarchy is enabled.
      
      test script:
        mount -t cgroup none /cgroups -o memory
        cd /cgroups
        mkdir cg1
      
        echo 1 > cg1/memory.use_hierarchy
        mkdir cg1/cg11
      
        echo $$ > cg1/cg11/tasks
        dd if=/dev/zero of=/tmp/foo bs=1M count=1
      
        echo
        echo CHILD
        grep cache cg1/cg11/memory.stat
      
        echo
        echo PARENT
        grep cache cg1/memory.stat
      
        echo $$ > tasks
        rmdir cg1/cg11 cg1
        cd /
        umount /cgroups
      
      Using fae9c791, a recent patch that changed alloc_css_id() depth computation,
      the parent incorrectly reports zero usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s
      
        CHILD
        cache 1048576
        total_cache 1048576
      
        PARENT
        cache 0
        total_cache 0
      
      With this patch, the parent correctly includes child usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s
      
        CHILD
        cache 1052672
        total_cache 1052672
      
        PARENT
        cache 0
        total_cache 1052672
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94b3dd0f
    • O
      sys_personality: change sys_personality() to accept "unsigned int" instead of u_long · 485d5276
      Oleg Nesterov 提交于
      task_struct->pesonality is "unsigned int", but sys_personality() paths use
      "unsigned long pesonality".  This means that every assignment or
      comparison is not right.  In particular, if this argument does not fit
      into "unsigned int" __set_personality() changes the caller's personality
      and then sys_personality() returns -EINVAL.
      
      Turn this argument into "unsigned int" and avoid overflows.  Obviously,
      this is the user-visible change, we just ignore the upper bits.  But this
      can't break the sane application.
      
      There is another thing which can confuse the poorly written applications.
      User-space thinks that this syscall returns int, not long.  This means
      that the returned value can be negative and look like the error code.  But
      note that libc won't be confused and thus errno won't be set, and with
      this patch the user-space can never get -1 unless sys_personality() really
      fails.  And, most importantly, the negative RET != -1 is only possible if
      that app previously called personality(RET).
      Pointed-out-by: NWenming Zhang <wezhang@redhat.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      485d5276
  13. 03 6月, 2010 2 次提交
  14. 02 6月, 2010 1 次提交
  15. 01 6月, 2010 2 次提交
  16. 31 5月, 2010 8 次提交
    • R
      blktrace: Fix new kernel-doc warnings · 546cf44a
      Randy Dunlap 提交于
      Fix blktrace.c kernel-doc warnings:
       Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore'
       Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20100529114507.c466fc1e.randy.dunlap@oracle.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      546cf44a
    • F
      perf_events: Fix unincremented buffer base on partial copy · 74048f89
      Frederic Weisbecker 提交于
      If a sample size crosses to the next page boundary, the copy
      will be made in more than one step. However we forget to advance
      the source offset for the next copy, leading to unexpected double
      copies that completely mess up the traces.
      
      This fixes various kinds of bad traces that have irrelevant
      data inside, as an example:
      
      	geany-4979  [001]  5758.077775: sched_switch: prev_comm=! prev_pid=121
      		prev_prio=0 prev_state=S|D|Z|X|x ==> next_comm= next_pid=7497072
      		next_prio=0
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1274988898-5639-1-git-send-regression-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74048f89
    • S
      perf_events: Fix event scheduling issues introduced by transactional API · 90151c35
      Stephane Eranian 提交于
      The transactional API patch between the generic and model-specific
      code introduced several important bugs with event scheduling, at
      least on X86. If you had pinned events, e.g., watchdog,  and were
      over-committing the PMU, you would get bogus counts. The bug was
      showing up on Intel CPU because events would move around more
      often that on AMD. But the problem also existed on AMD, though
      harder to expose.
      
      The issues were:
      
       - group_sched_in() was missing a cancel_txn() in the error path
      
       - cpuc->n_added was not properly maintained, leading to missing
         actions in hw_perf_enable(), i.e., n_running being 0. You cannot
         update n_added until you know the transaction has succeeded. In
         case of failed transaction n_added was not adjusted back.
      
       - in case of failed transactions, event_sched_out() was called
         and eventually invoked x86_disable_event() to touch the HW reg.
         But with transactions, on X86, event_sched_in() does not touch
         HW registers, it simply collects events into a list. Thus, you
         could end up calling x86_disable_event() on a counter which
         did not correspond to the current event when idx != -1.
      
      The patch modifies the generic and X86 code to avoid all those problems.
      
      First, we keep track of the number of events added last. In case the
      transaction fails, we substract them from n_added. This approach is
      necessary (as opposed to delaying updates to n_added) because not all
      event updates use the transaction API, e.g., single events.
      
      Second, we encapsulate the event_sched_in() and event_sched_out() in
      group_sched_in() inside the transaction. That makes the operations
      symmetrical and you can also detect that you are inside a transaction
      and skip the HW reg access by checking cpuc->group_flag.
      
      With this patch, you can now overcommit the PMU even with pinned
      system-wide events present and still get valid counts.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1274796225.5882.1389.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      90151c35
    • P
      perf_events, trace: Fix perf_trace_destroy(), mutex went missing · 2e97942f
      Peter Zijlstra 提交于
      Steve spotted I forgot to do the destroy under event_mutex.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1274451913.1674.1707.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e97942f
    • P
      perf_events, trace: Fix probe unregister race · 3771f077
      Peter Zijlstra 提交于
      tracepoint_probe_unregister() does not synchronize against the probe
      callbacks, so do that explicitly. This properly serializes the callbacks
      and the free of the data used therein.
      
      Also, use this_cpu_ptr() where possible.
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1274438476.1674.1702.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3771f077
    • P
      perf_events: Fix races in group composition · 8a49542c
      Peter Zijlstra 提交于
      Group siblings don't pin each-other or the parent, so when we destroy
      events we must make sure to clean up all cross referencing pointers.
      
      In particular, for destruction of a group leader we must be able to
      find all its siblings and remove their reference to it.
      
      This means that detaching an event from its context must not detach it
      from the group, otherwise we can end up failing to clear all pointers.
      
      Solve this by clearly separating the attachment to a context and
      attachment to a group, and keep the group composed until we destroy
      the events.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8a49542c
    • P
      perf_events: Fix races and clean up perf_event and perf_mmap_data interaction · ac9721f3
      Peter Zijlstra 提交于
      In order to move toward separate buffer objects, rework the whole
      perf_mmap_data construct to be a more self-sufficient entity, one
      with its own lifetime rules.
      
      This greatly sanitizes the whole output redirection code, which
      was riddled with bugs and races.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac9721f3
    • A
      sched: Make sure timers have migrated before killing the migration_thread · 54e88fad
      Amit K. Arora 提交于
      Problem: In a stress test where some heavy tests were running along with
      regular CPU offlining and onlining, a hang was observed. The system seems
      to be hung at a point where migration_call() tries to kill the
      migration_thread of the dying CPU, which just got moved to the current
      CPU. This migration thread does not get a chance to run (and die) since
      rt_throttled is set to 1 on current, and it doesn't get cleared as the
      hrtimer which is supposed to reset the rt bandwidth
      (sched_rt_period_timer) is tied to the CPU which we just marked dead!
      
      Solution: This patch pushes the killing of migration thread to
      "CPU_POST_DEAD" event. By then all the timers (including
      sched_rt_period_timer) should have got migrated (along with other
      callbacks).
      Signed-off-by: NAmit Arora <aarora@in.ibm.com>
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <20100525132346.GA14986@amitarora.in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      54e88fad