1. 24 4月, 2020 7 次提交
  2. 23 4月, 2020 1 次提交
    • M
      mm, compaction: capture a page under direct compaction · 35d915be
      Mel Gorman 提交于
      to #26255339
      
      commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream
      
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      35d915be
  3. 22 4月, 2020 3 次提交
  4. 18 3月, 2020 2 次提交
    • J
      io-wq: small threadpool implementation for io_uring · 8a308e54
      Jens Axboe 提交于
      commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.
      
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8a308e54
    • T
      sched/core, workqueues: Distangle worker accounting from rq lock · 143495ca
      Thomas Gleixner 提交于
      commit 6d25be5782e482eb93e3de0c94d0a517879377d0 upstream.
      
      The worker accounting for CPU bound workers is plugged into the core
      scheduler code and the wakeup code. This is not a hard requirement and
      can be avoided by keeping track of the state in the workqueue code
      itself.
      
      Keep track of the sleeping state in the worker itself and call the
      notifier before entering the core scheduler. There might be false
      positives when the task is woken between that call and actually
      scheduling, but that's not really different from scheduling and being
      woken immediately after switching away. When nr_running is updated when
      the task is retunrning from schedule() then it is later compared when it
      is done from ttwu().
      
      [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      143495ca
  5. 17 1月, 2020 5 次提交
    • J
      alinux: psi: using cpuacct_cgrp_id under CONFIG_CGROUP_CPUACCT · 1bd8a72b
      Joseph Qi 提交于
      This is to fix the build error if CONFIG_CGROUP_CPUACCT is not enabled.
        kernel/sched/psi.c: In function 'iterate_groups':
        kernel/sched/psi.c:732:31: error: 'cpuacct_cgrp_id' undeclared (first use in this function); did you mean 'cpuacct_charge'?
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1bd8a72b
    • H
      alinux: sched/fair: use static load in wake_affine_weight · d2440c99
      Huaixin Chang 提交于
      For a long time runnable cpu load has been used in selecting task rq
      when waking up tasks. Recent test has shown for test load with a large
      quantity of short running tasks and almost full cpu utility, static load
      is more helpful.
      
      In our e2e tests, runnable load avg of java threads ranges from less than
      10 to as large as 362, while these java threads are no different from
      each other, and should be treated in the same way. After using static
      load, qps imporvement has been seen in multiple test cases.
      
      A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo
      WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load
      in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This
      feature is kept off by default.
      
      Test is done on the following hardware:
      
      4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
      
      In tests with 120 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	33170.63                34614.95 (+4.35%)
      
      In tests with 160 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	35888.71                38247.20 (+6.57%)
      
      In tests with 160 threads and sql loglevel configured to warn:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	39118.72                39698.72 (+1.48%)
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      d2440c99
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
    • J
      alinux: introduce psi_v1 boot parameter · dc159a61
      Joseph Qi 提交于
      Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
      parameter psi_v1 to enable psi cgroup v1 support. Default it is
      disabled, which means when passing psi=1 boot parameter, we only support
      cgroup v2.
      This is to keep consistent with other cgroup v1 features such as cgroup
      writeback v1 (cgwb_v1).
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      dc159a61
    • X
      alinux: psi: Support PSI under cgroup v1 · 1f49a738
      Xunlei Pang 提交于
      Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f49a738
  6. 27 12月, 2019 22 次提交