1. 16 11月, 2014 1 次提交
  2. 10 11月, 2014 1 次提交
    • A
      sched/numa: Fix out of bounds read in sched_init_numa() · c123588b
      Andrey Ryabinin 提交于
      On latest mm + KASan patchset I've got this:
      
          ==================================================================
          BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
          =============================================================================
          BUG kmalloc-8 (Not tainted): kasan error
          -----------------------------------------------------------------------------
      
          Disabling lock debugging due to kernel taint
          INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
           __slab_alloc+0x4b4/0x4f0
           __kmalloc_track_caller+0x15f/0x1e0
           kstrdup+0x44/0x90
           alloc_vfsmnt+0xb0/0x2c0
           vfs_kern_mount+0x35/0x190
           kern_mount_data+0x25/0x50
           pid_ns_prepare_proc+0x19/0x50
           alloc_pid+0x5e2/0x630
           copy_process.part.41+0xdf5/0x2aa0
           do_fork+0xf5/0x460
           kernel_thread+0x21/0x30
           rest_init+0x1e/0x90
           start_kernel+0x522/0x531
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x15b/0x16a
          INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
          INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70
      
          Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
          Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5                          proc.kk.
          Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc                          ........
          Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
          CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    B          3.18.0-rc3-mm1+ #108
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
           ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
           ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
           ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
          Call Trace:
          dump_stack (lib/dump_stack.c:52)
          print_trailer (mm/slub.c:645)
          object_err (mm/slub.c:652)
          ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
          ? kasan_poison_shadow (mm/kasan/kasan.c:48)
          ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
          ? kasan_poison_shadow (mm/kasan/kasan.c:48)
          ? kasan_kmalloc (mm/kasan/kasan.c:311)
          __asan_load4 (mm/kasan/kasan.c:371)
          ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          kernel_init_freeable (init/main.c:869 init/main.c:997)
          ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
          ? rest_init (init/main.c:924)
          kernel_init (init/main.c:929)
          ? rest_init (init/main.c:924)
          ret_from_fork (arch/x86/kernel/entry_64.S:348)
          ? rest_init (init/main.c:924)
          Read of size 4 by task swapper/0:
          Memory state around the buggy address:
           ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
           ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
                                                                    ^
           ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
           ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
           ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
          ==================================================================
      
      Zero 'level' (e.g. on non-NUMA system) causing out of bounds
      access in this line:
      
           sched_max_numa_distance = sched_domains_numa_distance[level - 1];
      
      Fix this by exiting from sched_init_numa() earlier.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Fixes: 9942f79b ("sched/numa: Export info needed for NUMA balancing on complex topologies")
      Cc: peterz@infradead.org
      Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c123588b
  3. 04 11月, 2014 1 次提交
  4. 28 10月, 2014 8 次提交
    • K
      sched/dl: Fix preemption checks · f3a7e1a9
      Kirill Tkhai 提交于
      1) switched_to_dl() check is wrong. We reschedule only
         if rq->curr is deadline task, and we do not reschedule
         if it's a lower priority task. But we must always
         preempt a task of other classes.
      
      2) dl_task_timer():
         Policy does not change in case of priority inheritance.
         rt_mutex_setprio() changes prio, while policy remains old.
      
      So we lose some balancing logic in dl_task_timer() and
      switched_to_dl() when we check policy instead of priority. Boosted
      task may be rq->curr.
      
      (I didn't change switched_from_dl() because no check is necessary
      there at all).
      
      I've looked at this place(switched_to_dl) several times and even fixed
      this function, but found just now...  I suppose some performance tests
      may work better after this.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3a7e1a9
    • O
      sched: stop the unbound recursion in preempt_schedule_context() · 009f60e2
      Oleg Nesterov 提交于
      preempt_schedule_context() does preempt_enable_notrace() at the end
      and this can call the same function again; exception_exit() is heavy
      and it is quite possible that need-resched is true again.
      
      1. Change this code to dec preempt_count() and check need_resched()
         by hand.
      
      2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
         the enable/disable dance around __schedule(). But in this case
         we need to move into sched/core.c.
      
      3. Cosmetic, but x86 forgets to declare this function. This doesn't
         really matter because it is only called by asm helpers, still it
         make sense to add the declaration into asm/preempt.h to match
         preempt_schedule().
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      009f60e2
    • K
      sched/fair: Fix division by zero sysctl_numa_balancing_scan_size · 64192658
      Kirill Tkhai 提交于
      File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
      
      This bash command reproduces problem:
      
      $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
      	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
      
      	divide error: 0000 [#1] SMP
      	Modules linked in:
      	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
      	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
      	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
      	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
      	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
      	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
      	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
      	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
      	Stack:
      	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
      	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
      	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
      	Call Trace:
      	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
      	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
      	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
      	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
      	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
      	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
      	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
      	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
      	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
      	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
      	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
      	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
      	 [<ffffffff8150d322>] page_fault+0x22/0x30
      	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP <ffff880037a6bce0>
      	---[ end trace 9a826d16936c04de ]---
      
      Also fix race in task_scan_min (it depends on compiler behaviour).
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64192658
    • Y
      sched/fair: Care divide error in update_task_scan_period() · 2847c90e
      Yasuaki Ishimatsu 提交于
      While offling node by hot removing memory, the following divide error
      occurs:
      
        divide error: 0000 [#1] SMP
        [...]
        Call Trace:
         [...] handle_mm_fault
         [...] ? try_to_wake_up
         [...] ? wake_up_state
         [...] __do_page_fault
         [...] ? do_futex
         [...] ? put_prev_entity
         [...] ? __switch_to
         [...] do_page_fault
         [...] page_fault
        [...]
        RIP  [<ffffffff810a7081>] task_numa_fault
         RSP <ffff88084eb2bcb0>
      
      The issue occurs as follows:
        1. When page fault occurs and page is allocated from node 1,
           task_struct->numa_faults_buffer_memory[] of node 1 is
           incremented and p->numa_faults_locality[] is also incremented
           as follows:
      
           o numa_faults_buffer_memory[]       o numa_faults_locality[]
                    NR_NUMA_HINT_FAULT_TYPES
                   |      0     |     1     |
           ----------------------------------  ----------------------
            node 0 |      0     |     0     |   remote |      0     |
            node 1 |      0     |     1     |   locale |      1     |
           ----------------------------------  ----------------------
      
        2. node 1 is offlined by hot removing memory.
      
        3. When page fault occurs, fault_types[] is calculated by using
           p->numa_faults_buffer_memory[] of all online nodes in
           task_numa_placement(). But node 1 was offline by step 2. So
           the fault_types[] is calculated by using only
           p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
           are set to 0.
      
        4. The values(0) of fault_types[] pass to update_task_scan_period().
      
        5. numa_faults_locality[1] is set to 1. So the following division is
           calculated.
      
              static void update_task_scan_period(struct task_struct *p,
                                      unsigned long shared, unsigned long private){
              ...
                      ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
              }
      
        6. But both of private and shared are set to 0. So divide error
           occurs here.
      
      The divide error is rare case because the trigger is node offline.
      This patch always increments denominator for avoiding divide error.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2847c90e
    • K
      sched/numa: Fix unsafe get_task_struct() in task_numa_assign() · 1effd9f1
      Kirill Tkhai 提交于
      Unlocked access to dst_rq->curr in task_numa_compare() is racy.
      If curr task is exiting this may be a reason of use-after-free:
      
      task_numa_compare()                    do_exit()
          ...                                        current->flags |= PF_EXITING;
          ...                                    release_task()
          ...                                        ~~delayed_put_task_struct()~~
          ...                                    schedule()
          rcu_read_lock()                        ...
          cur = ACCESS_ONCE(dst_rq->curr)        ...
              ...                                rq->curr = next;
              ...                                    context_switch()
              ...                                        finish_task_switch()
              ...                                            put_task_struct()
              ...                                                __put_task_struct()
              ...                                                    free_task_struct()
              task_numa_assign()                                     ...
                  get_task_struct()                                  ...
      
      As noted by Oleg:
      
        <<The lockless get_task_struct(tsk) is only safe if tsk == current
          and didn't pass exit_notify(), or if this tsk was found on a rcu
          protected list (say, for_each_process() or find_task_by_vpid()).
          IOW, it is only safe if release_task() was not called before we
          take rcu_read_lock(), in this case we can rely on the fact that
          delayed_put_pid() can not drop the (potentially) last reference
          until rcu_read_unlock().
      
          And as Kirill pointed out task_numa_compare()->task_numa_assign()
          path does get_task_struct(dst_rq->curr) and this is not safe. The
          task_struct itself can't go away, but rcu_read_lock() can't save
          us from the final put_task_struct() in finish_task_switch(); this
          reference goes away without rcu gp>>
      
      The patch provides simple check of PF_EXITING flag. If it's not set,
      this guarantees that call_rcu() of delayed_put_task_struct() callback
      hasn't happened yet, so we can safely do get_task_struct() in
      task_numa_assign().
      
      Locked dst_rq->lock protects from concurrency with the last schedule().
      Reusing or unmapping of cur's memory may happen without it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1effd9f1
    • J
      sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer() · aee38ea9
      Juri Lelli 提交于
      dl_task_timer() is racy against several paths. Daniel noticed that
      the replenishment timer may experience a race condition against an
      enqueue_dl_entity() called from rt_mutex_setprio(). With his own
      words:
      
       rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
       start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
       sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
       throttled is 0
      
      => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
      enqueued on the -deadline runqueue.
      
      As we do for the other races, we just bail out in the replenishment
      timer code.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aee38ea9
    • J
      sched/deadline: Don't replenish from a !SCHED_DEADLINE entity · 64be6f1f
      Juri Lelli 提交于
      In the deboost path, right after the dl_boosted flag has been
      reset, we can currently end up replenishing using -deadline
      parameters of a !SCHED_DEADLINE entity. This of course causes
      a bug, as those parameters are empty.
      
      In the case depicted above it is safe to simply bail out, as
      the deboosted task is going to be back to its original scheduling
      class anyway.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64be6f1f
    • K
      sched: Fix race between task_group and sched_task_group · eeb61e53
      Kirill Tkhai 提交于
      The race may happen when somebody is changing task_group of a forking task.
      Child's cgroup is the same as parent's after dup_task_struct() (there just
      memory copying). Also, cfs_rq and rt_rq are the same as parent's.
      
      But if parent changes its task_group before it's called cgroup_post_fork(),
      we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
      the same, while child's task_group changes in cgroup_post_fork().
      
      To fix this we introduce fork() method, which calls sched_move_task() directly.
      This function changes sched_task_group on appropriate (also its logic has
      no problem with freshly created tasks, so we shouldn't introduce something
      special; we are able just to use it).
      
      Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eeb61e53
  5. 10 10月, 2014 1 次提交
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
  6. 03 10月, 2014 4 次提交
    • K
      sched/dl: Use dl_bw_of() under rcu_read_lock_sched() · f10e00f4
      Kirill Tkhai 提交于
      rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it
      is not enough. We should use either rcu_read_lock_sched() or preempt_disable().
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: Kirill Tkhai <ktkhai@parallels.com
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Fixes: 66339c31 "sched: Use dl_bw_of() under RCU read lock"
      Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f10e00f4
    • K
      sched/fair: Delete resched_cpu() from idle_balance() · 10a12983
      Kirill Tkhai 提交于
      We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
      if this is necessary.
      
      Furthermore, a higher priority class task may be current on dest rq,
      we shouldn't disturb it.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhostSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10a12983
    • R
      sched, time: Fix build error with 64 bit cputime_t on 32 bit systems · 347abad9
      Rik van Riel 提交于
      On 32 bit systems cmpxchg cannot handle 64 bit values, so
      some additional magic is required to allow a 32 bit system
      with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.
      
      Make sure the correct cmpxchg function is used when doing
      an atomic swap of a cputime_t.
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Cc: oleg@redhat.com
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linux390@de.ibm.com
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      347abad9
    • V
      sched: Improve sysbench performance by fixing spurious active migration · 43f4d666
      Vincent Guittot 提交于
      Since commit caeb178c ("sched/fair: Make update_sd_pick_busiest() ...")
      sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
      but is only more loaded than others. This change has been introduced to ensure
      a better load balance in system that are not overloaded but as a side effect,
      it can also generate useless active migration between groups.
      
      Let take the example of 3 tasks on a quad cores system. We will always have an
      idle core so the load balance will find a busiest group (core) whenever an ILB
      is triggered and it will force an active migration (once above
      nr_balance_failed threshold) so the idle core becomes busy but another core
      will become idle. With the next ILB, the freshly idle core will try to pull the
      task of a busy CPU.
      The number of spurious active migration is not so huge in quad core system
      because the ILB is not triggered so much. But it becomes significant as soon as
      you have more than one sched_domain level like on a dual cluster of quad cores
      where the ILB is triggered every tick when you have more than 1 busy_cpu
      
      We need to ensure that the migration generate a real improveùent and will not
      only move the avg_load imbalance on another CPU.
      
      Before caeb178c, the filtering of such use
      case was ensured by the following test in f_b_g:
      
        if ((local->idle_cpus < busiest->idle_cpus) &&
      		    busiest->sum_nr_running  <= busiest->group_weight)
      
      This patch modified the condition to take into account situation where busiest
      group is not overloaded: If the diff between the number of idle cpus in 2
      groups is less than or equal to 1 and the busiest group is not overloaded,
      moving a task will not improve the load balance but just move it.
      
      A test with sysbench on a dual clusters of quad cores gives the following
      results:
      
        command: sysbench --test=cpu --num-threads=5 --max-time=5 run
      
      The HZ is 200 which means that 1000 ticks has fired during the test.
      
      With Mainline, perf gives the following figures:
      
       Samples: 727  of event 'sched:sched_migrate_task'
       Event count (approx.): 727
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          12.52%  migration/1      [unknown]      [.] 00000000
          12.52%  migration/5      [unknown]      [.] 00000000
          12.52%  migration/7      [unknown]      [.] 00000000
          12.10%  migration/6      [unknown]      [.] 00000000
          11.83%  migration/0      [unknown]      [.] 00000000
          11.83%  migration/3      [unknown]      [.] 00000000
          11.14%  migration/4      [unknown]      [.] 00000000
          10.87%  migration/2      [unknown]      [.] 00000000
           2.75%  sysbench         [unknown]      [.] 00000000
           0.83%  swapper          [unknown]      [.] 00000000
           0.55%  ktps65090charge  [unknown]      [.] 00000000
           0.41%  mmcqd/1          [unknown]      [.] 00000000
           0.14%  perf             [unknown]      [.] 00000000
      
      With this patch, perf gives the following figures
      
       Samples: 20  of event 'sched:sched_migrate_task'
       Event count (approx.): 20
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          80.00%  sysbench         [unknown]      [.] 00000000
          10.00%  swapper          [unknown]      [.] 00000000
           5.00%  ktps65090charge  [unknown]      [.] 00000000
           5.00%  migration/1      [unknown]      [.] 00000000
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43f4d666
  7. 25 9月, 2014 1 次提交
    • N
      SCHED: add some "wait..on_bit...timeout()" interfaces. · cbbce822
      NeilBrown 提交于
      In commit c1221321
         sched: Allow wait_on_bit_action() functions to support a timeout
      
      I suggested that a "wait_on_bit_timeout()" interface would not meet my
      need.  This isn't true - I was just over-engineering.
      
      Including a 'private' field in wait_bit_key instead of a focused
      "timeout" field was just premature generalization.  If some other
      use is ever found, it can be generalized or added later.
      
      So this patch renames "private" to "timeout" with a meaning "stop
      waiting when "jiffies" reaches or passes "timeout",
      and adds two of the many possible wait..bit..timeout() interfaces:
      
      wait_on_page_bit_killable_timeout(), which is the one I want to use,
      and out_of_line_wait_on_bit_timeout() which is a reasonably general
      example.  Others can be added as needed.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      cbbce822
  8. 24 9月, 2014 14 次提交
  9. 21 9月, 2014 1 次提交
  10. 19 9月, 2014 8 次提交