1. 23 10月, 2019 1 次提交
  2. 25 9月, 2019 2 次提交
    • M
      memcg, kmem: deprecate kmem.limit_in_bytes · 0158115f
      Michal Hocko 提交于
      Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
      which turned out to be really a bad idea because there are paths which
      cannot shrink the kernel memory usage enough to get below the limit (e.g.
      because the accounted memory is not reclaimable).  There are cases when
      the failure is even not allowed (e.g.  __GFP_NOFAIL).  This means that the
      kmem limit is in excess to the hard limit without any way to shrink and
      thus completely useless.  OOM killer cannot be invoked to handle the
      situation because that would lead to a premature oom killing.
      
      As a result many places might see ENOMEM returning from kmalloc and result
      in unexpected errors.  E.g.  a global OOM killer when there is a lot of
      free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
      therefore pagefault_out_of_memory would result in OOM killer.
      
      Please note that the kernel memory is still accounted to the overall limit
      along with the user memory so removing the kmem specific limit should
      still allow to contain kernel memory consumption.  Unlike the kmem one,
      though, it invokes memory reclaim and targeted memcg oom killing if
      necessary.
      
      Start the deprecation process by crying to the kernel log.  Let's see
      whether there are relevant usecases and simply return to EINVAL in the
      second stage if nobody complains in few releases.
      
      [akpm@linux-foundation.org: tweak documentation text]
      Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0158115f
    • V
      mm, page_owner, debug_pagealloc: save and dump freeing stack trace · 8974558f
      Vlastimil Babka 提交于
      The debug_pagealloc functionality is useful to catch buggy page allocator
      users that cause e.g.  use after free or double free.  When page
      inconsistency is detected, debugging is often simpler by knowing the call
      stack of process that last allocated and freed the page.  When page_owner
      is also enabled, we record the allocation stack trace, but not freeing.
      
      This patch therefore adds recording of freeing process stack trace to page
      owner info, if both page_owner and debug_pagealloc are configured and
      enabled.  With only page_owner enabled, this info is not useful for the
      memory leak debugging use case.  dump_page() is adjusted to print the
      info.  An example result of calling __free_pages() twice may look like
      this (note the page last free stack trace):
      
      BUG: Bad page state in process bash  pfn:13d8f8
      page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1affff800000000()
      raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
      page dumped because: nonzero _refcount
      page_owner tracks the page as freed
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
       prep_new_page+0x143/0x150
       get_page_from_freelist+0x289/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       khugepaged+0x6e/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      page last free stack trace:
       free_pcp_prepare+0x134/0x1e0
       free_unref_page+0x18/0x90
       khugepaged+0x7b/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      Modules linked in:
      CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
      Call Trace:
       dump_stack+0x85/0xc0
       bad_page.cold+0xba/0xbf
       rmqueue_pcplist.isra.0+0x6c5/0x6d0
       rmqueue+0x2d/0x810
       get_page_from_freelist+0x191/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       __get_free_pages+0xd/0x30
       __pud_alloc+0x2c/0x110
       copy_page_range+0x4f9/0x630
       dup_mmap+0x362/0x480
       dup_mm+0x68/0x110
       copy_process+0x19e1/0x1b40
       _do_fork+0x73/0x310
       __x64_sys_clone+0x75/0x80
       do_syscall_64+0x6e/0x1e0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f10af854a10
      ...
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8974558f
  3. 14 9月, 2019 2 次提交
  4. 12 9月, 2019 1 次提交
    • N
      dm: add clone target · 7431b783
      Nikos Tsironis 提交于
      Add the dm-clone target, which allows cloning of arbitrary block
      devices.
      
      dm-clone produces a one-to-one copy of an existing, read-only source
      device into a writable destination device: It presents a virtual block
      device which makes all data appear immediately, and redirects reads and
      writes accordingly.
      
      The main use case of dm-clone is to clone a potentially remote,
      high-latency, read-only, archival-type block device into a writable,
      fast, primary-type device for fast, low-latency I/O. The cloned device
      is visible/mountable immediately and the copy of the source device to
      the destination device happens in the background, in parallel with user
      I/O.
      
      When the cloning completes, the dm-clone table can be removed altogether
      and be replaced, e.g., by a linear table, mapping directly to the
      destination device.
      
      For further information and examples of how to use dm-clone, please read
      Documentation/admin-guide/device-mapper/dm-clone.rst
      Suggested-by: NVangelis Koukis <vkoukis@arrikto.com>
      Co-developed-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7431b783
  5. 11 9月, 2019 1 次提交
  6. 08 9月, 2019 1 次提交
  7. 06 9月, 2019 1 次提交
  8. 05 9月, 2019 1 次提交
    • N
      powerpc/64s/radix: introduce options to disable use of the tlbie instruction · 2275d7b5
      Nicholas Piggin 提交于
      Introduce two options to control the use of the tlbie instruction. A
      boot time option which completely disables the kernel using the
      instruction, this is currently incompatible with HASH MMU, KVM, and
      coherent accelerators.
      
      And a debugfs option can be switched at runtime and avoids using tlbie
      for invalidating CPU TLBs for normal process and kernel address
      mappings. Coherent accelerators are still managed with tlbie, as will
      KVM partition scope translations.
      
      Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
      basic implementation which does not attempt to make any optimisation
      beyond the tlbie implementation.
      
      This is useful for performance testing among other things. For example
      in certain situations on large systems, using IPIs may be faster than
      tlbie as they can be directed rather than broadcast. Later we may also
      take advantage of the IPIs to do more interesting things such as trim
      the mm cpumask more aggressively.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
      2275d7b5
  9. 04 9月, 2019 1 次提交
  10. 03 9月, 2019 3 次提交
    • M
      Documentation:kernel-per-CPU-kthreads.txt: Remove reference to elevator= · fa99165c
      Marcos Paulo de Souza 提交于
      This argument was not being considered since blk-mq was set by default,
      so removed this documentation to avoid confusion.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      
      .txt file is now .rst
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fa99165c
    • M
      block: elevator.c: Remove now unused elevator= argument · 85c0a037
      Marcos Paulo de Souza 提交于
      Since the inclusion of blk-mq, elevator argument was not being
      considered anymore, and it's utility died long with the legacy IO path,
      now removed too.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      
      Fold with doc removal patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85c0a037
    • P
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi 提交于
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutny <mkoutny@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2480c093
  11. 30 8月, 2019 1 次提交
  12. 29 8月, 2019 2 次提交
    • T
      blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7
      Tejun Heo 提交于
      Add a script which can be used to generate device-specific iocost
      linear model coefficients.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8504dea7
    • T
      blkcg: implement blk-iocost · 7caa4715
      Tejun Heo 提交于
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7caa4715
  13. 28 8月, 2019 2 次提交
  14. 23 8月, 2019 2 次提交
    • J
      dm verity: add root hash pkcs#7 signature verification · 88cd3e6c
      Jaskaran Khurana 提交于
      The verification is to support cases where the root hash is not secured
      by Trusted Boot, UEFI Secureboot or similar technologies.
      
      One of the use cases for this is for dm-verity volumes mounted after
      boot, the root hash provided during the creation of the dm-verity volume
      has to be secure and thus in-kernel validation implemented here will be
      used before we trust the root hash and allow the block device to be
      created.
      
      The signature being provided for verification must verify the root hash
      and must be trusted by the builtin keyring for verification to succeed.
      
      The hash is added as a key of type "user" and the description is passed
      to the kernel so it can look it up and use it for verification.
      
      Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
      hash verification is needed.
      
      Kernel commandline dm_verity module parameter 'require_signatures' will
      indicate whether to force root hash signature verification (for all dm
      verity volumes).
      Signed-off-by: NJaskaran Khurana <jaskarankhurana@linux.microsoft.com>
      Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      88cd3e6c
    • J
      Documentation: Update Documentation for iommu.passthrough · c8fb436b
      Joerg Roedel 提交于
      This kernel parameter now takes also effect on X86.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      c8fb436b
  15. 22 8月, 2019 1 次提交
  16. 20 8月, 2019 2 次提交
    • M
      security: Add a static lockdown policy LSM · 000d388e
      Matthew Garrett 提交于
      While existing LSMs can be extended to handle lockdown policy,
      distributions generally want to be able to apply a straightforward
      static policy. This patch adds a simple LSM that can be configured to
      reject either integrity or all lockdown queries, and can be configured
      at runtime (through securityfs), boot time (via a kernel parameter) or
      build time (via a kconfig option). Based on initial code by David
      Howells.
      Signed-off-by: NMatthew Garrett <mjg59@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      000d388e
    • T
      x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · c49a0a80
      Tom Lendacky 提交于
      There have been reports of RDRAND issues after resuming from suspend on
      some AMD family 15h and family 16h systems. This issue stems from a BIOS
      not performing the proper steps during resume to ensure RDRAND continues
      to function properly.
      
      RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
      reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
      support using CPUID, including the kernel, will believe that RDRAND is
      not supported.
      
      Update the CPU initialization to clear the RDRAND CPUID bit for any family
      15h and 16h processor that supports RDRAND. If it is known that the family
      15h or family 16h system does not have an RDRAND resume issue or that the
      system will not be placed in suspend, the "rdrand=force" kernel parameter
      can be used to stop the clearing of the RDRAND CPUID bit.
      
      Additionally, update the suspend and resume path to save and restore the
      MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
      place after resuming from suspend.
      
      Note, that clearing the RDRAND CPUID bit does not prevent a processor
      that normally supports the RDRAND instruction from executing it. So any
      code that determined the support based on family and model won't #UD.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
      Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "x86@kernel.org" <x86@kernel.org>
      Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
      c49a0a80
  17. 17 8月, 2019 1 次提交
  18. 14 8月, 2019 1 次提交
  19. 09 8月, 2019 2 次提交
  20. 04 8月, 2019 1 次提交
  21. 02 8月, 2019 1 次提交
  22. 01 8月, 2019 7 次提交
  23. 31 7月, 2019 1 次提交
  24. 24 7月, 2019 1 次提交
  25. 23 7月, 2019 1 次提交