1. 10 4月, 2018 1 次提交
    • P
      perf/core: Fix use-after-free in uprobe_perf_close() · 621b6d2e
      Prashant Bhole 提交于
      A use-after-free bug was caught by KASAN while running usdt related
      code (BCC project. bcc/tests/python/test_usdt2.py):
      
      	==================================================================
      	BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
      	Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870
      
      	CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G        W         4.16.0-next-20180409 #215
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      	Call Trace:
      	 dump_stack+0xc7/0x15b
      	 ? show_regs_print_info+0x5/0x5
      	 ? printk+0x9c/0xc3
      	 ? kmsg_dump_rewind_nolock+0x6e/0x6e
      	 ? uprobe_perf_close+0x222/0x3b0
      	 print_address_description+0x83/0x3a0
      	 ? uprobe_perf_close+0x222/0x3b0
      	 kasan_report+0x1dd/0x460
      	 ? uprobe_perf_close+0x222/0x3b0
      	 uprobe_perf_close+0x222/0x3b0
      	 ? probes_open+0x180/0x180
      	 ? free_filters_list+0x290/0x290
      	 trace_uprobe_register+0x1bb/0x500
      	 ? perf_event_attach_bpf_prog+0x310/0x310
      	 ? probe_event_disable+0x4e0/0x4e0
      	 perf_uprobe_destroy+0x63/0xd0
      	 _free_event+0x2bc/0xbd0
      	 ? lockdep_rcu_suspicious+0x100/0x100
      	 ? ring_buffer_attach+0x550/0x550
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? perf_event_release_kernel+0x3e4/0xc00
      	 ? __mutex_unlock_slowpath+0x12e/0x540
      	 ? wait_for_completion+0x430/0x430
      	 ? lock_downgrade+0x3c0/0x3c0
      	 ? lock_release+0x980/0x980
      	 ? do_raw_spin_trylock+0x118/0x150
      	 ? do_raw_spin_unlock+0x121/0x210
      	 ? do_raw_spin_trylock+0x150/0x150
      	 perf_event_release_kernel+0x5d4/0xc00
      	 ? put_event+0x30/0x30
      	 ? fsnotify+0xd2d/0xea0
      	 ? sched_clock_cpu+0x18/0x1a0
      	 ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
      	 ? pvclock_clocksource_read+0x152/0x2b0
      	 ? pvclock_read_flags+0x80/0x80
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? sched_clock_cpu+0x18/0x1a0
      	 ? pvclock_clocksource_read+0x152/0x2b0
      	 ? locks_remove_file+0xec/0x470
      	 ? pvclock_read_flags+0x80/0x80
      	 ? fcntl_setlk+0x880/0x880
      	 ? ima_file_free+0x8d/0x390
      	 ? lockdep_rcu_suspicious+0x100/0x100
      	 ? ima_file_check+0x110/0x110
      	 ? fsnotify+0xea0/0xea0
      	 ? kvm_sched_clock_read+0x1a/0x30
      	 ? rcu_note_context_switch+0x600/0x600
      	 perf_release+0x21/0x40
      	 __fput+0x264/0x620
      	 ? fput+0xf0/0xf0
      	 ? do_raw_spin_unlock+0x121/0x210
      	 ? do_raw_spin_trylock+0x150/0x150
      	 ? SyS_fchdir+0x100/0x100
      	 ? fsnotify+0xea0/0xea0
      	 task_work_run+0x14b/0x1e0
      	 ? task_work_cancel+0x1c0/0x1c0
      	 ? copy_fd_bitmaps+0x150/0x150
      	 ? vfs_read+0xe5/0x260
      	 exit_to_usermode_loop+0x17b/0x1b0
      	 ? trace_event_raw_event_sys_exit+0x1a0/0x1a0
      	 do_syscall_64+0x3f6/0x490
      	 ? syscall_return_slowpath+0x2c0/0x2c0
      	 ? lockdep_sys_exit+0x1f/0xaa
      	 ? syscall_return_slowpath+0x1a3/0x2c0
      	 ? lockdep_sys_exit+0x1f/0xaa
      	 ? prepare_exit_to_usermode+0x11c/0x1e0
      	 ? enter_from_user_mode+0x30/0x30
      	random: crng init done
      	 ? __put_user_4+0x1c/0x30
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7f41d95f9340
      	RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      	RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
      	RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
      	RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
      	R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
      	R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290
      
      	Allocated by task 870:
      	 kasan_kmalloc+0xa0/0xd0
      	 kmem_cache_alloc_node+0x11a/0x430
      	 copy_process.part.19+0x11a0/0x41c0
      	 _do_fork+0x1be/0xa20
      	 do_syscall_64+0x198/0x490
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      	Freed by task 0:
      	 __kasan_slab_free+0x12e/0x180
      	 kmem_cache_free+0x102/0x4d0
      	 free_task+0xfe/0x160
      	 __put_task_struct+0x189/0x290
      	 delayed_put_task_struct+0x119/0x250
      	 rcu_process_callbacks+0xa6c/0x1b60
      	 __do_softirq+0x238/0x7ae
      
      	The buggy address belongs to the object at ffff880384f9b480
      	 which belongs to the cache task_struct of size 12928
      
      It occurs because task_struct is freed before perf_event which refers
      to the task and task flags are checked while teardown of the event.
      perf_event_alloc() assigns task_struct to hw.target of perf_event,
      but there is no reference counting for it.
      
      As a fix we get_task_struct() in perf_event_alloc() at above mentioned
      assignment and put_task_struct() in _free_event().
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      621b6d2e
  2. 03 4月, 2018 17 次提交
  3. 31 3月, 2018 1 次提交
  4. 29 3月, 2018 3 次提交
  5. 28 3月, 2018 1 次提交
  6. 27 3月, 2018 1 次提交
  7. 24 3月, 2018 2 次提交
  8. 22 3月, 2018 1 次提交
  9. 21 3月, 2018 2 次提交
  10. 20 3月, 2018 11 次提交
    • J
      sched/debug: Adjust newlines for better alignment · e9ca2670
      Joe Lawrence 提交于
      Scheduler debug stats include newlines that display out of alignment
      when prefixed by timestamps.  For example, the dmesg utility:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   83.124251]
        runnable tasks:
         S           task   PID         tree-key  switches  prio     wait-time
        sum-exec        sum-sleep
        -----------------------------------------------------------------------------------------------------------
      
      At the same time, some syslog utilities (like rsyslog by default) don't
      like the additional newlines control characters, saving lines like this
      to /var/log/messages:
      
        Mar 16 16:02:29 localhost kernel: #012runnable tasks:#12 S           task   PID         tree-key ...
                                          ^^^^               ^^^^
      Clean these up by moving newline characters to their own SEQ_printf
      invocation.  This leaves the /proc/sched_debug unchanged, but brings the
      entire output into alignment when prefixed:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   62.410368] runnable tasks:
        [   62.410368]  S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
        [   62.410369] -----------------------------------------------------------------------------------------------------------
        [   62.410369]  I  kworker/u12:0     5      1932.215593       332   120         0.000000         3.621252         0.000000 0 0 /
      
      and no escaped control characters from rsyslog in /var/log/messages:
      
        Mar 16 16:15:06 localhost kernel: runnable tasks:
        Mar 16 16:15:06 localhost kernel: S           task   PID         tree-key  ...
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9ca2670
    • J
      sched/debug: Fix per-task line continuation for console output · a8c024cd
      Joe Lawrence 提交于
      When the SEQ_printf() macro prints to the console, it runs a simple
      printk() without KERN_CONT "continued" line printing.  The result of
      this is oddly wrapped task info, for example:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [   29.608611]  I
        [   29.608613]       rcu_sched     8      3252.013846      4087   120
        [   29.608614]         0.000000        29.090111         0.000000
        [   29.608615]  0 0
        [   29.608616]  /
      
      Modify SEQ_printf to use pr_cont() for expected one-line results:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [  106.716329]  S        cpuhp/5    37      2006.315026        14   120         0.000000         0.496893         0.000000 0 0 /
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8c024cd
    • S
      perf/cgroup: Fix child event counting bug · c917e0f2
      Song Liu 提交于
      When a perf_event is attached to parent cgroup, it should count events
      for all children cgroups:
      
         parent_group   <---- perf_event
           \
            - child_group  <---- process(es)
      
      However, in our tests, we found this perf_event cannot report reliable
      results. Here is an example case:
      
        # create cgroups
        mkdir -p /sys/fs/cgroup/p/c
        # start perf for parent group
        perf stat -e instructions -G "p"
      
        # on another console, run test process in child cgroup:
        stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs
      
        # after the test process is done, stop perf in the first console shows
      
             <not counted>      instructions              p
      
      The instruction should not be "not counted" as the process runs in the
      child cgroup.
      
      We found this is because perf_event->cgrp and cpuctx->cgrp are not
      identical, thus perf_event->cgrp are not updated properly.
      
      This patch fixes this by updating perf_cgroup properly for ancestor
      cgroup(s).
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c917e0f2
    • J
      jump_label: Disable jump labels in __exit code · 578ae447
      Josh Poimboeuf 提交于
      With the following commit:
      
        33352244 ("jump_label: Explicitly disable jump labels in __init code")
      
      ... we explicitly disabled jump labels in __init code, so they could be
      detected and not warned about in the following commit:
      
        dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      
      In-kernel __exit code has the same issue.  It's never used, so it's
      freed along with the rest of initmem.  But jump label entries in __exit
      code aren't explicitly disabled, so we get the following warning when
      enabling pr_debug() in __exit code:
      
        can't patch jump_label at dmi_sysfs_exit+0x0/0x2d
        WARNING: CPU: 0 PID: 22572 at kernel/jump_label.c:376 __jump_label_update+0x9d/0xb0
      
      Fix the warning by disabling all jump labels in initmem (which includes
      both __init and __exit code).
      Reported-and-tested-by: NLi Wang <liwang@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      Link: http://lkml.kernel.org/r/7121e6e595374f06616c505b6e690e275c0054d1.1521483452.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      578ae447
    • P
      sched/wait: Improve __var_waitqueue() code generation · b3fc5c9b
      Peter Zijlstra 提交于
      Since we fixed hash_64() to not suck there is no need to play games to
      attempt to improve the hash value on 64-bit.
      
      Also, since we don't use the bit value for the variables, use hash_ptr()
      directly.
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3fc5c9b
    • P
      sched/wait: Remove the wait_on_atomic_t() API · 9b8cce52
      Peter Zijlstra 提交于
      There are no users left (everyone got converted to wait_var_event()), remove it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9b8cce52
    • P
      sched/wait: Introduce wait_var_event() · 6b2bb726
      Peter Zijlstra 提交于
      As a replacement for the wait_on_atomic_t() API provide the
      wait_var_event() API.
      
      The wait_var_event() API is based on the very same hashed-waitqueue
      idea, but doesn't care about the type (atomic_t) or the specific
      condition (atomic_read() == 0). IOW. it's much more widely
      applicable/flexible.
      
      It shares all the benefits/disadvantages of a hashed-waitqueue
      approach with the existing wait_on_atomic_t/wait_on_bit() APIs.
      
      The API is modeled after the existing wait_event() API, but instead of
      taking a wait_queue_head, it takes an address. This addresses is
      hashed to obtain a wait_queue_head from the bit_wait_table.
      
      Similar to the wait_event() API, it takes a condition expression as
      second argument and will wait until this expression becomes true.
      
      The following are (mostly) identical replacements:
      
      	wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
      	wake_up_atomic_t(&my_atomic);
      
      	wait_var_event(&my_atomic, !atomic_read(&my_atomic));
      	wake_up_var(&my_atomic);
      
      The only difference is that wake_up_var() is an unconditional wakeup
      and doesn't check the previously hard-coded (atomic_read() == 0)
      condition here. This is of little concequence, since most callers are
      already conditional on atomic_dec_and_test() and the ones that are
      not, are trivial to make so.
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6b2bb726
    • P
      sched/fair: Update util_est only on util_avg updates · d519329f
      Patrick Bellasi 提交于
      The estimated utilization of a task is currently updated every time the
      task is dequeued. However, to keep overheads under control, PELT signals
      are effectively updated at maximum once every 1ms.
      
      Thus, for really short running tasks, it can happen that their util_avg
      value has not been updates since their last enqueue.  If such tasks are
      also frequently running tasks (e.g. the kind of workload generated by
      hackbench) it can also happen that their util_avg is updated only every
      few activations.
      
      This means that updating util_est at every dequeue potentially introduces
      not necessary overheads and it's also conceptually wrong if the util_avg
      signal has never been updated during a task activation.
      
      Let's introduce a throttling mechanism on task's util_est updates
      to sync them with util_avg updates. To make the solution memory
      efficient, both in terms of space and load/store operations, we encode a
      synchronization flag into the LSB of util_est.enqueued.
      This makes util_est an even values only metric, which is still
      considered good enough for its purpose.
      The synchronization bit is (re)set by __update_load_avg_se() once the
      PELT signal of a task has been updated during its last activation.
      
      Such a throttling mechanism allows to keep under control util_est
      overheads in the wakeup hot path, thus making it a suitable mechanism
      which can be enabled also on high-intensity workload systems.
      Thus, this now switches on by default the estimation utilization
      scheduler feature.
      Suggested-by: NChris Redpath <chris.redpath@arm.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d519329f
    • P
      sched/cpufreq/schedutil: Use util_est for OPP selection · a07630b8
      Patrick Bellasi 提交于
      When schedutil looks at the CPU utilization, the current PELT value for
      that CPU is returned straight away. In certain scenarios this can have
      undesired side effects and delays on frequency selection.
      
      For example, since the task utilization is decayed at wakeup time, a
      long sleeping big task newly enqueued does not add immediately a
      significant contribution to the target CPU. This introduces some latency
      before schedutil will be able to detect the best frequency required by
      that task.
      
      Moreover, the PELT signal build-up time is a function of the current
      frequency, because of the scale invariant load tracking support. Thus,
      starting from a lower frequency, the utilization build-up time will
      increase even more and further delays the selection of the actual
      frequency which better serves the task requirements.
      
      In order to reduce these kind of latencies, we integrate the usage
      of the CPU's estimated utilization in the sugov_get_util function.
      
      This allows to properly consider the expected utilization of a CPU which,
      for example, has just got a big task running after a long sleep period.
      Ultimately this allows to select the best frequency to run a task
      right after its wake-up.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a07630b8
    • P
      sched/fair: Use util_est in LB and WU paths · f9be3e59
      Patrick Bellasi 提交于
      When the scheduler looks at the CPU utilization, the current PELT value
      for a CPU is returned straight away. In certain scenarios this can have
      undesired side effects on task placement.
      
      For example, since the task utilization is decayed at wakeup time, when
      a long sleeping big task is enqueued it does not add immediately a
      significant contribution to the target CPU.
      As a result we generate a race condition where other tasks can be placed
      on the same CPU while it is still considered relatively empty.
      
      In order to reduce this kind of race conditions, this patch introduces the
      required support to integrate the usage of the CPU's estimated utilization
      in the wakeup path, via cpu_util_wake(), as well as in the load-balance
      path, via cpu_util() which is used by update_sg_lb_stats().
      
      The estimated utilization of a CPU is defined to be the maximum between
      its PELT's utilization and the sum of the estimated utilization (at
      previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
      This allows to properly represent the spare capacity of a CPU which, for
      example, has just got a big task running since a long sleep period.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9be3e59
    • P
      sched/fair: Add util_est on top of PELT · 7f65ea42
      Patrick Bellasi 提交于
      The util_avg signal computed by PELT is too variable for some use-cases.
      For example, a big task waking up after a long sleep period will have its
      utilization almost completely decayed. This introduces some latency before
      schedutil will be able to pick the best frequency to run a task.
      
      The same issue can affect task placement. Indeed, since the task
      utilization is already decayed at wakeup, when the task is enqueued in a
      CPU, this can result in a CPU running a big task as being temporarily
      represented as being almost empty. This leads to a race condition where
      other tasks can be potentially allocated on a CPU which just started to run
      a big task which slept for a relatively long period.
      
      Moreover, the PELT utilization of a task can be updated every [ms], thus
      making it a continuously changing value for certain longer running
      tasks. This means that the instantaneous PELT utilization of a RUNNING
      task is not really meaningful to properly support scheduler decisions.
      
      For all these reasons, a more stable signal can do a better job of
      representing the expected/estimated utilization of a task/cfs_rq.
      Such a signal can be easily created on top of PELT by still using it as
      an estimator which produces values to be aggregated on meaningful
      events.
      
      This patch adds a simple implementation of util_est, a new signal built on
      top of PELT's util_avg where:
      
          util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
      
      This allows to remember how big a task has been reported by PELT in its
      previous activations via f(task::util_avg@dequeue), which is the new
      _task_util_est(struct task_struct*) function added by this patch.
      
      If a task should change its behavior and it runs longer in a new
      activation, after a certain time its util_est will just track the
      original PELT signal (i.e. task::util_avg).
      
      The estimated utilization of cfs_rq is defined only for root ones.
      That's because the only sensible consumer of this signal are the
      scheduler and schedutil when looking for the overall CPU utilization
      due to FAIR tasks.
      
      For this reason, the estimated utilization of a root cfs_rq is simply
      defined as:
      
          util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
      
      where:
      
          cfs_rq::util_est::enqueued = sum(_task_util_est(task))
                                       for each RUNNABLE task on that root cfs_rq
      
      It's worth noting that the estimated utilization is tracked only for
      objects of interests, specifically:
      
       - Tasks: to better support tasks placement decisions
       - root cfs_rqs: to better support both tasks placement decisions as
                       well as frequencies selection
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7f65ea42