1. 25 8月, 2020 1 次提交
    • P
      rcu: Provide optional RCU-reader exit delay for strict GPs · 3d29aaf1
      Paul E. McKenney 提交于
      The goal of this series is to increase the probability of tools like
      KASAN detecting that an RCU-protected pointer was used outside of its
      RCU read-side critical section.  Thus far, the approach has been to make
      grace periods and callback processing happen faster.  Another approach
      is to delay the pointer leaker.  This commit therefore allows a delay
      to be applied to exit from RCU read-side critical sections.
      
      This slowdown is specified by a new rcutree.rcu_unlock_delay kernel boot
      parameter that specifies this delay in microseconds, defaulting to zero.
      
      Reported-by Jann Horn <jannh@google.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      3d29aaf1
  2. 14 8月, 2020 4 次提交
  3. 13 8月, 2020 10 次提交
    • A
      mfd: Replace HTTP links with HTTPS ones · 4f4ed454
      Alexander A. Klimov 提交于
      Rationale:
      Reduces attack surface on kernel devs opening the links for MITM
      as HTTPS traffic is much harder to manipulate.
      
      Deterministic algorithm:
      For each file:
        If not .svg:
          For each line:
            If doesn't contain `\bxmlns\b`:
              For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
      	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
                  If both the HTTP and HTTPS versions
                  return 200 OK and serve the same content:
                    Replace HTTP with HTTPS.
      Signed-off-by: NAlexander A. Klimov <grandmaster@al2klimov.de>
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      4f4ed454
    • F
      dt-bindings: mfd: st,stmfx: Remove I2C unit name · a3f673d0
      Fabio Estevam 提交于
      Remove the I2C unit name to fix the following build warning with
      'make dt_binding_check':
      
      Warning (unit_address_vs_reg): /example-0/i2c@0: node has a unit name, but no reg or ranges property
      Signed-off-by: NFabio Estevam <festevam@gmail.com>
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      a3f673d0
    • R
      dt-bindings: mfd: ti,j721e-system-controller.yaml: Add J721e system controller · e9faaf05
      Roger Quadros 提交于
      Add DT binding schema for J721e system controller.
      Signed-off-by: NRoger Quadros <rogerq@ti.com>
      Reviewed-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      e9faaf05
    • M
      mfd: smsc-ece1099: Remove driver · 7d2594cd
      Michael Walle 提交于
      This MFD driver has no user. The keypad driver of this device never made
      it into the kernel. Therefore, this driver is useless. Remove it.
      Signed-off-by: NMichael Walle <michael@walle.cc>
      Cc: Sourav Poddar <sourav.poddar@ti.com>
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      7d2594cd
    • L
      coredump: add %f for executable filename · f38c85f1
      Lepton Wu 提交于
      The document reads "%e" should be "executable filename" while actually it
      could be changed by things like pr_ctl PR_SET_NAME.  People who uses "%e"
      in core_pattern get surprised when they find out they get thread name
      instead of executable filename.
      
      This is either a bug of document or a bug of code.  Since the behavior of
      "%e" is there for long time, it could bring another surprise for users if
      we "fix" the code.
      
      So we just "fix" the document.  And more, for users who really need the
      "executable filename" in core_pattern, we introduce a new "%f" for the
      real executable filename.  We already have "%E" for executable path in
      kernel, so just reuse most of its code for the new added "%f" format.
      Signed-off-by: NLepton Wu <ytht.net@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f38c85f1
    • A
      mm/vmstat: add events for THP migration without split · 1a5bae25
      Anshuman Khandual 提交于
      Add following new vmstat events which will help in validating THP
      migration without split.  Statistics reported through these new VM events
      will help in performance debugging.
      
      1. THP_MIGRATION_SUCCESS
      2. THP_MIGRATION_FAILURE
      3. THP_MIGRATION_SPLIT
      
      In addition, these new events also update normal page migration statistics
      appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE.  While here,
      this updates current trace event 'mm_migrate_pages' to accommodate now
      available THP statistics.
      
      [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
      [ziy@nvidia.com: v2]
        Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
      [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
        Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5bae25
    • M
      doc, mm: clarify /proc/<pid>/oom_score value range · b1aa7c93
      Michal Hocko 提交于
      The exported value includes oom_score_adj so the range is no [0, 1000] as
      described in the previous section but rather [0, 2000].  Mention that fact
      explicitly.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1aa7c93
    • M
      doc, mm: sync up oom_score_adj documentation · de3f32e1
      Michal Hocko 提交于
      There are at least two notes in the oom section.  The 3% discount for root
      processes is gone since d46078b2 ("mm, oom: remove 3% bonus for
      CAP_SYS_ADMIN processes").
      
      Likewise children of the selected oom victim are not sacrificed since
      bbbe4802 ("mm, oom: remove 'prefer children over parent' heuristic")
      
      Drop both of them.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de3f32e1
    • N
      mm: proactive compaction · facdaa91
      Nitin Gupta 提交于
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
    • R
      mm: memcg/percpu: per-memcg percpu memory statistics · 772616b0
      Roman Gushchin 提交于
      Percpu memory can represent a noticeable chunk of the total memory
      consumption, especially on big machines with many CPUs.  Let's track
      percpu memory usage for each memcg and display it in memory.stat.
      
      A percpu allocation is usually scattered over multiple pages (and nodes),
      and can be significantly smaller than a page.  So let's add a byte-sized
      counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
      created for slabs can be perfectly reused for percpu case.
      
      [guro@fb.com: v3]
        Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      772616b0
  4. 12 8月, 2020 13 次提交
  5. 11 8月, 2020 2 次提交
    • R
      cpufreq: intel_pstate: Implement passive mode with HWP enabled · f6ebbcf0
      Rafael J. Wysocki 提交于
      Allow intel_pstate to work in the passive mode with HWP enabled and
      make it set the HWP minimum performance limit (HWP floor) to the
      P-state value given by the target frequency supplied by the cpufreq
      governor, so as to prevent the HWP algorithm and the CPU scheduler
      from working against each other, at least when the schedutil governor
      is in use, and update the intel_pstate documentation accordingly.
      
      Among other things, this allows utilization clamps to be taken
      into account, at least to a certain extent, when intel_pstate is
      in use and makes it more likely that sufficient capacity for
      deadline tasks will be provided.
      
      After this change, the resulting behavior of an HWP system with
      intel_pstate in the passive mode should be close to the behavior
      of the analogous non-HWP system with intel_pstate in the passive
      mode, except that the HWP algorithm is generally allowed to make the
      CPU run at a frequency above the floor P-state set by intel_pstate in
      the entire available range of P-states, while without HWP a CPU can
      run in a P-state above the requested one if the latter falls into the
      range of turbo P-states (referred to as the turbo range) or if the
      P-states of all CPUs in one package are coordinated with each other
      at the hardware level.
      
      [Note that in principle the HWP floor may not be taken into account
       by the processor if it falls into the turbo range, in which case the
       processor has a license to choose any P-state, either below or above
       the HWP floor, just like a non-HWP processor in the case when the
       target P-state falls into the turbo range.]
      
      With this change applied, intel_pstate in the passive mode assumes
      complete control over the HWP request MSR and concurrent changes of
      that MSR (eg. via the direct MSR access interface) are overridden by
      it.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Reviewed-by: NFrancisco Jerez <currojerez@riseup.net>
      f6ebbcf0
    • J
      zonefs: update documentation to reflect zone size vs capacity · 4c96870e
      Johannes Thumshirn 提交于
      Update the zonefs documentation to reflect the difference between a zone's
      size and it's capacity.
      
      The maximum file size in zonefs is the zones capacity, for ZBC and ZAC
      based devices, which do not have a separate zone capacity, the zone
      capacity is equal to the zone size.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      4c96870e
  6. 10 8月, 2020 3 次提交
    • M
      kbuild: introduce hostprogs-always-y and userprogs-always-y · faabed29
      Masahiro Yamada 提交于
      To build host programs, you need to add the program names to 'hostprogs'
      to use the necessary build rule, but it is not enough to build them
      because there is no dependency.
      
      There are two types of host programs: built as the prerequisite of
      another (e.g. gen_crc32table in lib/Makefile), or always built when
      Kbuild visits the Makefile (e.g. genksyms in scripts/genksyms/Makefile).
      
      The latter is typical in Makefiles under scripts/, which contains host
      programs globally used during the kernel build. To build them, you need
      to add them to both 'hostprogs' and 'always-y'.
      
      This commit adds hostprogs-always-y as a shorthand.
      
      The same applies to user programs. net/bpfilter/Makefile builds
      bpfilter_umh on demand, hence always-y is unneeded. In contrast,
      programs under samples/ are added to both 'userprogs' and 'always-y'
      so they are always built when Kbuild visits the Makefiles.
      
      userprogs-always-y works as a shorthand.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      faabed29
    • A
      kbuild: Replace HTTP links with HTTPS ones · 16a122c7
      Alexander A. Klimov 提交于
      Rationale:
      Reduces attack surface on kernel devs opening the links for MITM
      as HTTPS traffic is much harder to manipulate.
      
      Deterministic algorithm:
      For each file:
        If not .svg:
          For each line:
            If doesn't contain `\bxmlns\b`:
              For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
      	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
                  If both the HTTP and HTTPS versions
                  return 200 OK and serve the same content:
                    Replace HTTP with HTTPS.
      Signed-off-by: NAlexander A. Klimov <grandmaster@al2klimov.de>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      16a122c7
    • M
      kbuild: introduce ccflags-remove-y and asflags-remove-y · 15d5761a
      Masahiro Yamada 提交于
      CFLAGS_REMOVE_<file>.o filters out flags when compiling a particular
      object, but there is no convenient way to do that for every object in
      a directory.
      
      Add ccflags-remove-y and asflags-remove-y to make it easily.
      
      Use ccflags-remove-y to clean up some Makefiles.
      
      The add/remove order works as follows:
      
       [1] KBUILD_CFLAGS specifies compiler flags used globally
      
       [2] ccflags-y adds compiler flags for all objects in the
           current Makefile
      
       [3] ccflags-remove-y removes compiler flags for all objects in the
           current Makefile (New feature)
      
       [4] CFLAGS_<file> adds compiler flags per file.
      
       [5] CFLAGS_REMOVE_<file> removes compiler flags per file.
      
      Having [3] before [4] allows us to remove flags from most (but not all)
      objects in the current Makefile.
      
      For example, kernel/trace/Makefile removes $(CC_FLAGS_FTRACE)
      from all objects in the directory, then adds it back to
      trace_selftest_dynamic.o and CFLAGS_trace_kprobe_selftest.o
      
      The same applies to lib/livepatch/Makefile.
      
      Please note ccflags-remove-y has no effect to the sub-directories.
      In contrast, the previous notation got rid of compiler flags also from
      all the sub-directories.
      
      The following are not affected because they have no sub-directories:
      
        arch/arm/boot/compressed/
        arch/powerpc/xmon/
        arch/sh/
        kernel/trace/
      
      However, lib/ has several sub-directories.
      
      To keep the behavior, I added ccflags-remove-y to all Makefiles
      in subdirectories of lib/, except the following:
      
        lib/vdso/Makefile        - Kbuild does not descend into this Makefile
        lib/raid/test/Makefile   - This is not used for the kernel build
      
      I think commit 2464a609 ("ftrace: do not trace library functions")
      excluded too much. In the next commit, I will remove ccflags-remove-y
      from the sub-directories of lib/.
      Suggested-by: NSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: Brendan Higgins <brendanhiggins@google.com> (KUnit)
      Tested-by: NAnders Roxell <anders.roxell@linaro.org>
      15d5761a
  7. 08 8月, 2020 7 次提交