1. 22 4月, 2020 3 次提交
  2. 18 3月, 2020 2 次提交
    • J
      io-wq: small threadpool implementation for io_uring · 8a308e54
      Jens Axboe 提交于
      commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.
      
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8a308e54
    • T
      sched/core, workqueues: Distangle worker accounting from rq lock · 143495ca
      Thomas Gleixner 提交于
      commit 6d25be5782e482eb93e3de0c94d0a517879377d0 upstream.
      
      The worker accounting for CPU bound workers is plugged into the core
      scheduler code and the wakeup code. This is not a hard requirement and
      can be avoided by keeping track of the state in the workqueue code
      itself.
      
      Keep track of the sleeping state in the worker itself and call the
      notifier before entering the core scheduler. There might be false
      positives when the task is woken between that call and actually
      scheduling, but that's not really different from scheduling and being
      woken immediately after switching away. When nr_running is updated when
      the task is retunrning from schedule() then it is later compared when it
      is done from ttwu().
      
      [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      143495ca
  3. 17 1月, 2020 5 次提交
    • J
      alinux: psi: using cpuacct_cgrp_id under CONFIG_CGROUP_CPUACCT · 1bd8a72b
      Joseph Qi 提交于
      This is to fix the build error if CONFIG_CGROUP_CPUACCT is not enabled.
        kernel/sched/psi.c: In function 'iterate_groups':
        kernel/sched/psi.c:732:31: error: 'cpuacct_cgrp_id' undeclared (first use in this function); did you mean 'cpuacct_charge'?
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1bd8a72b
    • H
      alinux: sched/fair: use static load in wake_affine_weight · d2440c99
      Huaixin Chang 提交于
      For a long time runnable cpu load has been used in selecting task rq
      when waking up tasks. Recent test has shown for test load with a large
      quantity of short running tasks and almost full cpu utility, static load
      is more helpful.
      
      In our e2e tests, runnable load avg of java threads ranges from less than
      10 to as large as 362, while these java threads are no different from
      each other, and should be treated in the same way. After using static
      load, qps imporvement has been seen in multiple test cases.
      
      A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo
      WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load
      in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This
      feature is kept off by default.
      
      Test is done on the following hardware:
      
      4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
      
      In tests with 120 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	33170.63                34614.95 (+4.35%)
      
      In tests with 160 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	35888.71                38247.20 (+6.57%)
      
      In tests with 160 threads and sql loglevel configured to warn:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	39118.72                39698.72 (+1.48%)
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      d2440c99
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
    • J
      alinux: introduce psi_v1 boot parameter · dc159a61
      Joseph Qi 提交于
      Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
      parameter psi_v1 to enable psi cgroup v1 support. Default it is
      disabled, which means when passing psi=1 boot parameter, we only support
      cgroup v2.
      This is to keep consistent with other cgroup v1 features such as cgroup
      writeback v1 (cgwb_v1).
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      dc159a61
    • X
      alinux: psi: Support PSI under cgroup v1 · 1f49a738
      Xunlei Pang 提交于
      Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f49a738
  4. 27 12月, 2019 23 次提交
  5. 13 12月, 2019 2 次提交
    • X
      sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision · 742f2319
      Xuewei Zhang 提交于
      commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream.
      
      The quota/period ratio is used to ensure a child task group won't get
      more bandwidth than the parent task group, and is calculated as:
      
        normalized_cfs_quota() = [(quota_us << 20) / period_us]
      
      If the quota/period ratio was changed during this scaling due to
      precision loss, it will cause inconsistency between parent and child
      task groups.
      
      See below example:
      
      A userspace container manager (kubelet) does three operations:
      
       1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
       2) Create a few children cgroups.
       3) Set quota to 1,000us and period to 10,000us on a child cgroup.
      
      These operations are expected to succeed. However, if the scaling of
      147/128 happens before step 3, quota and period of the parent cgroup
      will be changed:
      
        new_quota: 1148437ns,   1148us
       new_period: 11484375ns, 11484us
      
      And when step 3 comes in, the ratio of the child cgroup will be
      104857, which will be larger than the parent cgroup ratio (104821),
      and will fail.
      
      Scaling them by a factor of 2 will fix the problem.
      Tested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NXuewei Zhang <xueweiz@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
      Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      742f2319
    • P
      sched/core: Avoid spurious lock dependencies · 870083b6
      Peter Zijlstra 提交于
      [ Upstream commit ff51ff84d82aea5a889b85f2b9fb3aa2b8691668 ]
      
      While seemingly harmless, __sched_fork() does hrtimer_init(), which,
      when DEBUG_OBJETS, can end up doing allocations.
      
      This then results in the following lock order:
      
        rq->lock
          zone->lock.rlock
            batched_entropy_u64.lock
      
      Which in turn causes deadlocks when we do wakeups while holding that
      batched_entropy lock -- as the random code does.
      
      Solve this by moving __sched_fork() out from under rq->lock. This is
      safe because nothing there relies on rq->lock, as also evident from the
      other __sched_fork() callsite.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: bigeasy@linutronix.de
      Cc: cl@linux.com
      Cc: keescook@chromium.org
      Cc: penberg@kernel.org
      Cc: rientjes@google.com
      Cc: thgarnie@google.com
      Cc: tytso@mit.edu
      Cc: will@kernel.org
      Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy")
      Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      870083b6
  6. 01 12月, 2019 2 次提交
  7. 21 11月, 2019 1 次提交
  8. 13 11月, 2019 2 次提交
    • Q
      sched/fair: Fix -Wunused-but-set-variable warnings · e9c0fc4a
      Qian Cai 提交于
      commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream.
      
      Commit:
      
         de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      
      introduced a few compilation warnings:
      
        kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
        kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
        kernel/sched/fair.c: In function 'start_cfs_bandwidth':
        kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]
      
      Also, __refill_cfs_bandwidth_runtime() does no longer update the
      expiration time, so fix the comments accordingly.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pauld@redhat.com
      Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      e9c0fc4a
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151