1. 25 12月, 2019 1 次提交
  2. 28 11月, 2019 1 次提交
  3. 20 11月, 2019 11 次提交
    • B
      resource/docs: Complete kernel-doc style function documentation · 1de9c7c3
      Borislav Petkov 提交于
      commit f26621e60b35369bca9228bc936dc723b3e421af upstream.
      
      Add the missing kernel-doc style function parameters documentation.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: linux-tip-commits@vger.kernel.org
      Cc: rdunlap@infradead.org
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1de9c7c3
    • R
      resource/docs: Fix new kernel-doc warnings · 39cecf2f
      Randy Dunlap 提交于
      commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.
      
      The first group of warnings is caused by a "/**" kernel-doc notation
      marker but the function comments are not in kernel-doc format.
      Also add another error return value here.
      
        ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
        ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'
      
      Add the missing function parameter documentation for the other warnings:
      
        ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
        ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
      Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [joseph: fix find_next_iomem_res() documentation]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      39cecf2f
    • D
      mm/resource: Let walk_system_ram_range() search child resources · 3ed62604
      Dave Hansen 提交于
      commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream
      
      In the process of onlining memory, we use walk_system_ram_range()
      to find the actual RAM areas inside of the area being onlined.
      
      However, it currently only finds memory resources which are
      "top-level" iomem_resources.  Children are not currently
      searched which causes it to skip System RAM in areas like this
      (in the format of /proc/iomem):
      
      a0000000-bfffffff : Persistent Memory (legacy)
        a0000000-afffffff : System RAM
      
      Changing the true->false here allows children to be searched
      as well.  We need this because we add a new "System RAM"
      resource underneath the "persistent memory" resource when
      we use persistent memory in a volatile mode.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      3ed62604
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · cdb8d31f
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      cdb8d31f
    • D
      mm/resource: Return real error codes from walk failures · 2c33206f
      Dave Hansen 提交于
      commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream
      
      walk_system_ram_range() can return an error code either becuase
      *it* failed, or because the 'func' that it calls returned an
      error.  The memory hotplug does the following:
      
      	ret = walk_system_ram_range(..., func);
              if (ret)
      		return ret;
      
      and 'ret' makes it out to userspace, eventually.  The problem
      s, walk_system_ram_range() failues that result from *it* failing
      (as opposed to 'func') return -1.  That leads to a very odd
      -EPERM (-1) return code out to userspace.
      
      Make walk_system_ram_range() return -EINVAL for internal
      failures to keep userspace less confused.
      
      This return code is compatible with all the callers that I
      audited.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      2c33206f
    • O
      kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · d0bc6e68
      Oscar Salvador 提交于
      commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream
      
      This is a preparation for the next patch.
      
      Currently, we only call release_mem_region_adjustable() in __remove_pages
      if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
      are being released by themselves with devm_release_mem_region.
      
      Since we do not want to touch any zone/page stuff during the removing of
      the memory (but during the offlining), we do not want to check for the
      zone here.  So we need another way to tell release_mem_region_adjustable()
      to not realease the resource in case it belongs to HMM/devm.
      
      HMM/devm acquires/releases a resource through
      devm_request_mem_region/devm_release_mem_region.
      
      These resources have the flag IORESOURCE_MEM, while resources acquired by
      hot-add memory path (register_memory_resource()) contain
      IORESOURCE_SYSTEM_RAM.
      
      So, we can check for this flag in release_mem_region_adjustable, and if
      the resource does not contain such flag, we know that we are dealing with
      a HMM/devm resource, so we can back off.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      d0bc6e68
    • B
      resource: Clean it up a bit · 02cc5207
      Borislav Petkov 提交于
      commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream
      
      - Drop BUG_ON()s and do normal error handling instead, in
        find_next_iomem_res().
      
      - Align function arguments on opening braces.
      
      - Get rid of local var sibling_only in find_next_iomem_res().
      
      - Shorten unnecessarily long first_level_children_only arg name.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Bjorn Helgaas <bhelgaas@google.com>
      CC: Brijesh Singh <brijesh.singh@amd.com>
      CC: Dan Williams <dan.j.williams@intel.com>
      CC: H. Peter Anvin <hpa@zytor.com>
      CC: Lianbo Jiang <lijiang@redhat.com>
      CC: Takashi Iwai <tiwai@suse.de>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Vivek Goyal <vgoyal@redhat.com>
      CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      CC: bhe@redhat.com
      CC: dan.j.williams@intel.com
      CC: dyoung@redhat.com
      CC: kexec@lists.infradead.org
      CC: mingo@redhat.com
      Link: <new submission>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      02cc5207
    • K
      ICX: perf/x86: Disable extended registers for non-supported PMUs · a6b5314d
      Kan Liang 提交于
      commit e321d02db87af7840da29ef833a2a71fc0eab198 upstream.
      
      The perf fuzzer caused Skylake machine to crash:
      
      [ 9680.085831] Call Trace:
      [ 9680.088301]  <IRQ>
      [ 9680.090363]  perf_output_sample_regs+0x43/0xa0
      [ 9680.094928]  perf_output_sample+0x3aa/0x7a0
      [ 9680.099181]  perf_event_output_forward+0x53/0x80
      [ 9680.103917]  __perf_event_overflow+0x52/0xf0
      [ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
      [ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
      [ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
      [ 9680.122091]  ? check_preempt_curr+0x62/0x90
      [ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
      [ 9680.130355]  ? try_to_wake_up+0x54/0x460
      [ 9680.134366]  ? reweight_entity+0x15b/0x1a0
      [ 9680.138559]  ? __queue_work+0x103/0x3f0
      [ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
      [ 9680.147194]  ? timerqueue_del+0x1e/0x40
      [ 9680.151092]  ? __remove_hrtimer+0x35/0x70
      [ 9680.155191]  __hrtimer_run_queues+0x100/0x280
      [ 9680.159658]  hrtimer_interrupt+0x100/0x220
      [ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
      [ 9680.168555]  apic_timer_interrupt+0xf/0x20
      [ 9680.172756]  </IRQ>
      
      The XMM registers can only be collected by PEBS hardware events on the
      platforms with PEBS baseline support, e.g. Icelake, not software/probe
      events.
      
      Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
      which support extended registers. For X86, the extended registers are
      XMM registers.
      
      Add has_extended_regs() to check if extended registers are applied.
      
      The generic code define the mask of extended registers as 0 if arch
      headers haven't overridden it.
      Originally-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers")
      Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a6b5314d
    • A
      ICX: perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs · fd8d0f3e
      Andrew Murray 提交于
      commit cc6795aeffea0a80d0baf9ad31ba926a6c42cef5 upstream.
      
      Many PMU drivers do not have the capability to exclude counting events
      that occur in specific contexts such as idle, kernel, guest, etc. These
      drivers indicate this by returning an error in their event_init upon
      testing the events attribute flags. This approach is error prone and
      often inconsistent.
      
      Let's instead allow PMU drivers to advertise their inability to exclude
      based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
      allows the perf core to reject requests for exclusion events where
      there is no support in the PMU.
      Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sascha Hauer <s.hauer@pengutronix.de>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: robin.murphy@arm.com
      Cc: suzuki.poulose@arm.com
      Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fd8d0f3e
    • A
      ICX: perf/ring_buffer: Fix AUX software double buffering · b4394fb2
      Alexander Shishkin 提交于
      commit 26ae4f4406f88d82d79c85c11ac5fae18213cd38 upstream.
      
      This recent commit:
      
        5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")
      
      overlooked the fact that the previous one page granularity of the AUX buffer
      provided an implicit double buffering capability to the PMU driver, which
      went away when the entire buffer became one high-order page.
      
      Always make the full-trace mode AUX allocation at least two-part to preserve
      the previous behavior and allow the implicit double buffering to continue.
      Reported-by: NAmmy Yi <ammy.yi@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: adrian.hunter@intel.com
      Fixes: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")
      Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShen, Xiaochen <xiaochen.shen@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b4394fb2
    • J
      x86/mm: Split vmalloc_sync_all() · 73c092d2
      Joerg Roedel 提交于
      commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.
      
      Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
      vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance back,
      split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the above
      mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
      Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      73c092d2
  4. 07 11月, 2019 2 次提交
  5. 01 11月, 2019 3 次提交
    • Q
      sched/fair: Fix -Wunused-but-set-variable warnings · 793ddb52
      Qian Cai 提交于
      commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream.
      
      Commit:
      
         de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      
      introduced a few compilation warnings:
      
        kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
        kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
        kernel/sched/fair.c: In function 'start_cfs_bandwidth':
        kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]
      
      Also, __refill_cfs_bandwidth_runtime() does no longer update the
      expiration time, so fix the comments accordingly.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pauld@redhat.com
      Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      793ddb52
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 192fa322
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
              /* extend local deadline, drift is bounded above by 2 ticks */
              cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      192fa322
    • B
      sched/fair: Don't push cfs_bandwith slack timers forward · 9a99f90a
      bsegall@google.com 提交于
      commit 66567fcbaecac455caa1b13643155d686b51ce63 upstream.
      
      When a cfs_rq sleeps and returns its quota, we delay for 5ms before
      waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
      sleep, as this has to be done outside of the rq lock we hold.
      
      The current code waits for 5ms without any sleeps, instead of waiting
      for 5ms from the first sleep, which can delay the unthrottle more than
      we want. Switch this around so that we can't push this forward forever.
      
      This requires an extra flag rather than using hrtimer_active, since we
      need to start a new timer if the current one is in the process of
      finishing.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      9a99f90a
  6. 30 10月, 2019 22 次提交