1. 08 12月, 2019 1 次提交
    • Y
      alios: mm: memcontrol: support background async page reclaim · 6b2ef082
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6b2ef082
  2. 28 11月, 2019 2 次提交
  3. 20 11月, 2019 9 次提交
  4. 07 11月, 2019 3 次提交
    • T
      blkcg: add tools/cgroup/iocost_coef_gen.py · 2868b62d
      Tejun Heo 提交于
      commit 8504dea783b044cab620acbaef87b86ee84646fe upstream.
      
      Add a script which can be used to generate device-specific iocost
      linear model coefficients.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2868b62d
    • T
      blkcg: implement blk-iocost · abd942b2
      Tejun Heo 提交于
      commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.
      
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: fix confilcts with ioc_rqos_throttle()]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      abd942b2
    • T
      cgroup: add cgroup_parse_float() · cb2f6e75
      Tejun Heo 提交于
      commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.
      
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cb2f6e75
  5. 01 11月, 2019 1 次提交
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 192fa322
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
              /* extend local deadline, drift is bounded above by 2 ticks */
              cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      192fa322
  6. 30 10月, 2019 9 次提交
  7. 18 10月, 2019 1 次提交
  8. 12 10月, 2019 3 次提交
  9. 21 9月, 2019 1 次提交
  10. 16 9月, 2019 5 次提交
  11. 29 8月, 2019 1 次提交
    • T
      x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · e063b03b
      Tom Lendacky 提交于
      commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream.
      
      There have been reports of RDRAND issues after resuming from suspend on
      some AMD family 15h and family 16h systems. This issue stems from a BIOS
      not performing the proper steps during resume to ensure RDRAND continues
      to function properly.
      
      RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
      reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
      support using CPUID, including the kernel, will believe that RDRAND is
      not supported.
      
      Update the CPU initialization to clear the RDRAND CPUID bit for any family
      15h and 16h processor that supports RDRAND. If it is known that the family
      15h or family 16h system does not have an RDRAND resume issue or that the
      system will not be placed in suspend, the "rdrand=force" kernel parameter
      can be used to stop the clearing of the RDRAND CPUID bit.
      
      Additionally, update the suspend and resume path to save and restore the
      MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
      place after resuming from suspend.
      
      Note, that clearing the RDRAND CPUID bit does not prevent a processor
      that normally supports the RDRAND instruction from executing it. So any
      code that determined the support based on family and model won't #UD.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
      Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "x86@kernel.org" <x86@kernel.org>
      Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e063b03b
  12. 07 8月, 2019 2 次提交
    • J
      Documentation: Add swapgs description to the Spectre v1 documentation · 7634b9cd
      Josh Poimboeuf 提交于
      commit 4c92057661a3412f547ede95715641d7ee16ddac upstream
      
      Add documentation to the Spectre document about the new swapgs variant of
      Spectre v1.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7634b9cd
    • J
      x86/speculation: Enable Spectre v1 swapgs mitigations · 23e7a7b3
      Josh Poimboeuf 提交于
      commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream
      
      The previous commit added macro calls in the entry code which mitigate the
      Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
      enabled.  Enable those features where applicable.
      
      The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
      
      There are different features which can affect the risk of attack:
      
      - When FSGSBASE is enabled, unprivileged users are able to place any
        value in GS, using the wrgsbase instruction.  This means they can
        write a GS value which points to any value in kernel space, which can
        be useful with the following gadget in an interrupt/exception/NMI
        handler:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	// dependent load or store based on the value of %reg
      	// for example: mov %(reg1), %reg2
      
        If an interrupt is coming from user space, and the entry code
        speculatively skips the swapgs (due to user branch mistraining), it
        may speculatively execute the GS-based load and a subsequent dependent
        load or store, exposing the kernel data to an L1 side channel leak.
      
        Note that, on Intel, a similar attack exists in the above gadget when
        coming from kernel space, if the swapgs gets speculatively executed to
        switch back to the user GS.  On AMD, this variant isn't possible
        because swapgs is serializing with respect to future GS-based
        accesses.
      
        NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
      	doesn't exist quite yet.
      
      - When FSGSBASE is disabled, the issue is mitigated somewhat because
        unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
        restricts GS values to user space addresses only.  That means the
        gadget would need an additional step, since the target kernel address
        needs to be read from user space first.  Something like:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	mov (%reg1), %reg2
      	// dependent load or store based on the value of %reg2
      	// for example: mov %(reg2), %reg3
      
        It's difficult to audit for this gadget in all the handlers, so while
        there are no known instances of it, it's entirely possible that it
        exists somewhere (or could be introduced in the future).  Without
        tooling to analyze all such code paths, consider it vulnerable.
      
        Effects of SMAP on the !FSGSBASE case:
      
        - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
          susceptible to Meltdown), the kernel is prevented from speculatively
          reading user space memory, even L1 cached values.  This effectively
          disables the !FSGSBASE attack vector.
      
        - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
          still prevents the kernel from speculatively reading user space
          memory.  But it does *not* prevent the kernel from reading the
          user value from L1, if it has already been cached.  This is probably
          only a small hurdle for an attacker to overcome.
      
      Thanks to Dave Hansen for contributing the speculative_smap() function.
      
      Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
      is serializing on AMD.
      
      [ tglx: Fixed the USER fence decision and polished the comment as suggested
        	by Dave Hansen ]
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23e7a7b3
  13. 26 7月, 2019 2 次提交
    • J
      dt-bindings: allow up to four clocks for orion-mdio · 404f59e2
      Josua Mayer 提交于
      commit 80785f5a22e9073e2ded5958feb7f220e066d17b upstream.
      
      Armada 8040 needs four clocks to be enabled for MDIO accesses to work.
      Update the binding to allow the extra clock to be specified.
      
      Cc: stable@vger.kernel.org
      Fixes: 6d6a331f ("dt-bindings: allow up to three clocks for orion-mdio")
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NJosua Mayer <josua@solid-run.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      404f59e2
    • P
      x86/atomic: Fix smp_mb__{before,after}_atomic() · 9e0bcb59
      Peter Zijlstra 提交于
      [ Upstream commit 69d927bba39517d0980462efc051875b7f4db185 ]
      
      Recent probing at the Linux Kernel Memory Model uncovered a
      'surprise'. Strongly ordered architectures where the atomic RmW
      primitive implies full memory ordering and
      smp_mb__{before,after}_atomic() are a simple barrier() (such as x86)
      fail for:
      
      	*x = 1;
      	atomic_inc(u);
      	smp_mb__after_atomic();
      	r0 = *y;
      
      Because, while the atomic_inc() implies memory order, it
      (surprisingly) does not provide a compiler barrier. This then allows
      the compiler to re-order like so:
      
      	atomic_inc(u);
      	*x = 1;
      	smp_mb__after_atomic();
      	r0 = *y;
      
      Which the CPU is then allowed to re-order (under TSO rules) like:
      
      	atomic_inc(u);
      	r0 = *y;
      	*x = 1;
      
      And this very much was not intended. Therefore strengthen the atomic
      RmW ops to include a compiler barrier.
      
      NOTE: atomic_{or,and,xor} and the bitops already had the compiler
      barrier.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9e0bcb59