1. 09 9月, 2021 2 次提交
  2. 13 8月, 2021 1 次提交
  3. 18 7月, 2021 1 次提交
  4. 09 7月, 2021 2 次提交
  5. 02 7月, 2021 1 次提交
    • A
      init: print out unknown kernel parameters · 86d1919a
      Andrew Halaney 提交于
      It is easy to foobar setting a kernel parameter on the command line
      without realizing it, there's not much output that you can use to assess
      what the kernel did with that parameter by default.
      
      Make it a little more explicit which parameters on the command line
      _looked_ like a valid parameter for the kernel, but did not match anything
      and ultimately got tossed to init.  This is very similar to the unknown
      parameter message received when loading a module.
      
      This assumes the parameters are processed in a normal fashion, some
      parameters (dyndbg= for example) don't register their parameter with the
      rest of the kernel's parameters, and therefore always show up in this list
      (and are also given to init - like the rest of this list).
      
      Another example is BOOT_IMAGE= is highlighted as an offender, which it
      technically is, but is passed by LILO and GRUB so most systems will see
      that complaint.
      
      An example output where "foobared" and "unrecognized" are intentionally
      invalid parameters:
      
        Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
        Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo
      
      Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.comSigned-off-by: NAndrew Halaney <ahalaney@redhat.com>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86d1919a
  6. 23 6月, 2021 1 次提交
  7. 18 6月, 2021 1 次提交
  8. 11 6月, 2021 1 次提交
  9. 05 6月, 2021 1 次提交
    • M
      pid: take a reference when initializing `cad_pid` · 0711f0d7
      Mark Rutland 提交于
      During boot, kernel_init_freeable() initializes `cad_pid` to the init
      task's struct pid.  Later on, we may change `cad_pid` via a sysctl, and
      when this happens proc_do_cad_pid() will increment the refcount on the
      new pid via get_pid(), and will decrement the refcount on the old pid
      via put_pid().  As we never called get_pid() when we initialized
      `cad_pid`, we decrement a reference we never incremented, can therefore
      free the init task's struct pid early.  As there can be dangling
      references to the struct pid, we can later encounter a use-after-free
      (e.g.  when delivering signals).
      
      This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
      have been around since the conversion of `cad_pid` to struct pid in
      commit 9ec52099 ("[PATCH] replace cad_pid by a struct pid") from the
      pre-KASAN stone age of v2.6.19.
      
      Fix this by getting a reference to the init task's struct pid when we
      assign it to `cad_pid`.
      
      Full KASAN splat below.
      
         ==================================================================
         BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
         BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
         Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
      
         CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
         Hardware name: linux,dummy-virt (DT)
         Call trace:
          ns_of_pid include/linux/pid.h:153 [inline]
          task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
          do_notify_parent+0x308/0xe60 kernel/signal.c:1950
          exit_notify kernel/exit.c:682 [inline]
          do_exit+0x2334/0x2bd0 kernel/exit.c:845
          do_group_exit+0x108/0x2c8 kernel/exit.c:922
          get_signal+0x4e4/0x2a88 kernel/signal.c:2781
          do_signal arch/arm64/kernel/signal.c:882 [inline]
          do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
          work_pending+0xc/0x2dc
      
         Allocated by task 0:
          slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
          slab_alloc_node mm/slub.c:2907 [inline]
          slab_alloc mm/slub.c:2915 [inline]
          kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
          alloc_pid+0xdc/0xc00 kernel/pid.c:180
          copy_process+0x2794/0x5e18 kernel/fork.c:2129
          kernel_clone+0x194/0x13c8 kernel/fork.c:2500
          kernel_thread+0xd4/0x110 kernel/fork.c:2552
          rest_init+0x44/0x4a0 init/main.c:687
          arch_call_rest_init+0x1c/0x28
          start_kernel+0x520/0x554 init/main.c:1064
          0x0
      
         Freed by task 270:
          slab_free_hook mm/slub.c:1562 [inline]
          slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
          slab_free mm/slub.c:3161 [inline]
          kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
          put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
          put_pid+0x30/0x48 kernel/pid.c:109
          proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
          proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
          proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
          call_write_iter include/linux/fs.h:1977 [inline]
          new_sync_write+0x3ac/0x510 fs/read_write.c:518
          vfs_write fs/read_write.c:605 [inline]
          vfs_write+0x9c4/0x1018 fs/read_write.c:585
          ksys_write+0x124/0x240 fs/read_write.c:658
          __do_sys_write fs/read_write.c:670 [inline]
          __se_sys_write fs/read_write.c:667 [inline]
          __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
          __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
          invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
          el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
          do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
          el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
          el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
          el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
      
         The buggy address belongs to the object at ffff23794dda0000
          which belongs to the cache pid of size 224
         The buggy address is located 4 bytes inside of
          224-byte region [ffff23794dda0000, ffff23794dda00e0)
         The buggy address belongs to the page:
         page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
         head:(____ptrval____) order:1 compound_mapcount:0
         flags: 0x3fffc0000010200(slab|head)
         raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
         raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
         page dumped because: kasan: bad access detected
      
         Memory state around the buggy address:
          ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
          ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
          ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
         ==================================================================
      
      Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
      Fixes: 9ec52099 ("[PATCH] replace cad_pid by a struct pid")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0711f0d7
  10. 01 6月, 2021 2 次提交
  11. 27 5月, 2021 1 次提交
  12. 12 5月, 2021 2 次提交
  13. 11 5月, 2021 1 次提交
    • F
      srcu: Initialize SRCU after timers · 8e9c01c7
      Frederic Weisbecker 提交于
      Once srcu_init() is called, the SRCU core will make use of delayed
      workqueues, which rely on timers.  However init_timers() is called
      several steps after rcu_init().  This means that a call_srcu() after
      rcu_init() but before init_timers() would find itself within a dangerously
      uninitialized timer core.
      
      This commit therefore creates a separate call to srcu_init() after
      init_timer() completes, which ensures that we stay in early SRCU mode
      until timers are safe(r).
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e9c01c7
  14. 07 5月, 2021 2 次提交
    • R
      modules: add CONFIG_MODPROBE_PATH · 17652f42
      Rasmus Villemoes 提交于
      Allow the developer to specifiy the initial value of the modprobe_path[]
      string.  This can be used to set it to the empty string initially, thus
      effectively disabling request_module() during early boot until userspace
      writes a new value via the /proc/sys/kernel/modprobe interface.  [1]
      
      When building a custom kernel (often for an embedded target), it's normal
      to build everything into the kernel that is needed for booting, and indeed
      the initramfs often contains no modules at all, so every such
      request_module() done before userspace init has mounted the real rootfs is
      a waste of time.
      
      This is particularly useful when combined with the previous patch, which
      made the initramfs unpacking asynchronous - for that to work, it had to
      make any usermodehelper call wait for the unpacking to finish before
      attempting to invoke the userspace helper.  By eliminating all such
      (known-to-be-futile) calls of usermodehelper, the initramfs unpacking and
      the {device,late}_initcalls can proceed in parallel for much longer.
      
      For a relatively slow ppc board I'm working on, the two patches combined
      lead to 0.2s faster boot - but more importantly, the fact that the
      initramfs unpacking proceeds completely in the background while devices
      get probed means I get to handle the gpio watchdog in time without getting
      reset.
      
      [1] __request_module() already has an early -ENOENT return when
      modprobe_path is the empty string.
      
      Link: https://lkml.kernel.org/r/20210313212528.2956377-3-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17652f42
    • R
      init/initramfs.c: do unpacking asynchronously · e7cb072e
      Rasmus Villemoes 提交于
      Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
      
      These two patches are independent, but better-together.
      
      The second is a rather trivial patch that simply allows the developer to
      change "/sbin/modprobe" to something else - e.g.  the empty string, so
      that all request_module() during early boot return -ENOENT early, without
      even spawning a usermode helper, needlessly synchronizing with the
      initramfs unpacking.
      
      The first patch delegates decompressing the initramfs to a worker thread,
      allowing do_initcalls() in main.c to proceed to the device_ and late_
      initcalls without waiting for that decompression (and populating of
      rootfs) to finish.  Obviously, some of those later calls may rely on the
      initramfs being available, so I've added synchronization points in the
      firmware loader and usermodehelper paths - there might be other places
      that would need this, but so far no one has been able to think of any
      places I have missed.
      
      There's not much to win if most of the functionality needed during boot is
      only available as modules.  But systems with a custom-made .config and
      initramfs can boot faster, partly due to utilizing more than one cpu
      earlier, partly by avoiding known-futile modprobe calls (which would still
      trigger synchronization with the initramfs unpacking, thus eliminating
      most of the first benefit).
      
      This patch (of 2):
      
      Most of the boot process doesn't actually need anything from the
      initramfs, until of course PID1 is to be executed.  So instead of doing
      the decompressing and populating of the initramfs synchronously in
      populate_rootfs() itself, push that off to a worker thread.
      
      This is primarily motivated by an embedded ppc target, where unpacking
      even the rather modest sized initramfs takes 0.6 seconds, which is long
      enough that the external watchdog becomes unhappy that it doesn't get
      attention soon enough.  By doing the initramfs decompression in a worker
      thread, we get to do the device_initcalls and hence start petting the
      watchdog much sooner.
      
      Normal desktops might benefit as well.  On my mostly stock Ubuntu kernel,
      my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
      That takes almost two seconds:
      
      [    0.201454] Trying to unpack rootfs image as initramfs...
      [    1.976633] Freeing initrd memory: 29416K
      
      Before this patch, these lines occur consecutively in dmesg.  With this
      patch, the timestamps on these two lines is roughly the same as above, but
      with 172 lines inbetween - so more than one cpu has been kept busy doing
      work that would otherwise only happen after the populate_rootfs()
      finished.
      
      Should one of the initcalls done after rootfs_initcall time (i.e., device_
      and late_ initcalls) need something from the initramfs (say, a kernel
      module or a firmware blob), it will simply wait for the initramfs
      unpacking to be done before proceeding, which should in theory make this
      completely safe.
      
      But if some driver pokes around in the filesystem directly and not via one
      of the official kernel interfaces (i.e.  request_firmware*(),
      call_usermodehelper*) that theory may not hold - also, I certainly might
      have missed a spot when sprinkling wait_for_initramfs().  So there is an
      escape hatch in the form of an initramfs_async= command line parameter.
      
      Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
      Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7cb072e
  15. 06 5月, 2021 1 次提交
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
  16. 01 5月, 2021 2 次提交
  17. 28 4月, 2021 1 次提交
    • F
      bpf: Implement formatted output helpers with bstr_printf · 48cac3f4
      Florent Revest 提交于
      BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
      and bpf_snprintf. Their signatures specify that all arguments are
      provided from the BPF world as u64s (in an array or as registers). All
      of these helpers are currently implemented by calling functions such as
      snprintf() whose signatures take a variable number of arguments, then
      placed in a va_list by the compiler to call vsnprintf().
      
      "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
      a bpf_printf_prepare function that fills an array of u64 sanitized
      arguments with an array of "modifiers" which indicate what the "real"
      size of each argument should be (given by the format specifier). The
      BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
      its real size. However, the C promotion rules implicitely cast them all
      back to u64s. Therefore, the arguments given to snprintf are u64s and
      the va_list constructed by the compiler will use 64 bits for each
      argument. On 64 bit machines, this happens to work well because 32 bit
      arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
      architectures this breaks the layout of the va_list expected by the
      called function and mangles values.
      
      In "88a5c690 bpf: fix bpf_trace_printk on 32 bit archs", this problem
      had been solved for bpf_trace_printk only with a "horrid workaround"
      that emitted multiple calls to trace_printk where each call had
      different argument types and generated different va_list layouts. One of
      the call would be dynamically chosen at runtime. This was ok with the 3
      arguments that bpf_trace_printk takes but bpf_seq_printf and
      bpf_snprintf accept up to 12 arguments. Because this approach scales
      code exponentially, it is not a viable option anymore.
      
      Because the promotion rules are part of the language and because the
      construction of a va_list is an arch-specific ABI, it's best to just
      avoid variadic arguments and va_lists altogether. Thankfully the
      kernel's snprintf() has an alternative in the form of bstr_printf() that
      accepts arguments in a "binary buffer representation". These binary
      buffers are currently created by vbin_printf and used in the tracing
      subsystem to split the cost of printing into two parts: a fast one that
      only dereferences and remembers values, and a slower one, called later,
      that does the pretty-printing.
      
      This patch refactors bpf_printf_prepare to construct binary buffers of
      arguments consumable by bstr_printf() instead of arrays of arguments and
      modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
      bpf_printf_prepare usage but there are a few gotchas that change how
      bpf_printf_prepare needs to do things.
      
      Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
      generic storage for strings and IP addresses. With this refactoring, the
      temporary buffers now holds all the arguments in a structured binary
      format.
      
      To comply with the format expected by bstr_printf, certain format
      specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
      Because vsnprintf subroutines for these specifiers are hard to expose,
      we pre-format these arguments with calls to snprintf().
      Reported-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
      48cac3f4
  18. 25 4月, 2021 5 次提交
    • A
      kbuild: redo fake deps at include/config/*.h · 0e0345b7
      Alexey Dobriyan 提交于
      Make include/config/foo/bar.h fake deps files generation simpler.
      
      * delete .h suffix
      	those aren't header files, shorten filenames,
      
      * delete tolower()
      	Linux filesystems can deal with both upper and lowercase
      	filenames very well,
      
      * put everything in 1 directory
      	Presumably 'mkdir -p' split is from dark times when filesystems
      	handled huge directories badly, disks were round adding to
      	seek times.
      
      	x86_64 allmodconfig lists 12364 files in include/config.
      
      	../obj/include/config/
      	├── 104_QUAD_8
      	├── 60XX_WDT
      	├── 64BIT
      		...
      	├── ZSWAP_DEFAULT_ON
      	├── ZSWAP_ZPOOL_DEFAULT
      	└── ZSWAP_ZPOOL_DEFAULT_ZBUD
      
      	0 directories, 12364 files
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      0e0345b7
    • Y
      kbuild: add an elfnote for whether vmlinux is built with lto · 1fdd7433
      Yonghong Song 提交于
      Currently, clang LTO built vmlinux won't work with pahole.
      LTO introduced cross-cu dwarf tag references and broke
      current pahole model which handles one cu as a time.
      The solution is to merge all cu's as one pahole cu as in [1].
      We would like to do this merging only if cross-cu dwarf
      references happens. The LTO build mode is a pretty good
      indication for that.
      
      In earlier version of this patch ([2]), clang flag
      -grecord-gcc-switches is proposed to add to compilation flags
      so pahole could detect "-flto" and then merging cu's.
      This will increate the binary size of 1% without LTO though.
      
      Arnaldo suggested to use a note to indicate the vmlinux
      is built with LTO. Such a cheap way to get whether the vmlinux
      is built with LTO or not helps pahole but is also useful
      for tracing as LTO may inline/delete/demote global functions,
      promote static functions, etc.
      
      So this patch added an elfnote with a new type LINUX_ELFNOTE_LTO_INFO.
      The owner of the note is "Linux".
      
      With gcc 8.4.1 and clang trunk, without LTO, I got
        $ readelf -n vmlinux
        Displaying notes found in: .notes
          Owner                Data size        Description
        ...
          Linux                0x00000004       func
           description data: 00 00 00 00
        ...
      With "readelf -x ".notes" vmlinux", I can verify the above "func"
      with type code 0x101.
      
      With clang thin-LTO, I got the same as above except the following:
           description data: 01 00 00 00
      which indicates the vmlinux is built with LTO.
      
        [1] https://lore.kernel.org/bpf/20210325065316.3121287-1-yhs@fb.com/
        [2] https://lore.kernel.org/bpf/20210331001623.2778934-1-yhs@fb.com/Suggested-by: NArnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v12.0.0-rc4 (x86-64)
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      1fdd7433
    • P
      kbuild: add support for zstd compressed modules · c3d7ef37
      Piotr Gorski 提交于
      kmod 28 supports modules compressed in zstd format so let's add this
      possibility to kernel.
      Signed-off-by: NPiotr Gorski <lucjan.lucjanov@gmail.com>
      Reviewed-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      c3d7ef37
    • M
      kbuild: remove CONFIG_MODULE_COMPRESS · d4bbe942
      Masahiro Yamada 提交于
      CONFIG_MODULE_COMPRESS is only used to activate the choice for module
      compression algorithm. It will be simpler to make the choice always
      visible, and add CONFIG_MODULE_COMPRESS_NONE in the choice.
      
      This is more consistent with the "Kernel compression mode" and "Built-in
      initramfs compression mode" choices. CONFIG_KERNEL_UNCOMPRESSED and
      CONFIG_INITRAMFS_COMPRESSION_NONE are available to choose no compression.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      d4bbe942
    • M
      kbuild: check the minimum assembler version in Kconfig · ba64beb1
      Masahiro Yamada 提交于
      Documentation/process/changes.rst defines the minimum assembler version
      (binutils version), but we have never checked it in the build time.
      
      Kbuild never invokes 'as' directly because all assembly files in the
      kernel tree are *.S, hence must be preprocessed. I do not expect
      raw assembly source files (*.s) would be added to the kernel tree.
      
      Therefore, we always use $(CC) as the assembler driver, and commit
      aa824e0c ("kbuild: remove AS variable") removed 'AS'. However,
      we are still interested in the version of the assembler acting behind.
      
      As usual, the --version option prints the version string.
      
        $ as --version | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      But, we do not have $(AS). So, we can add the -Wa prefix so that
      $(CC) passes --version down to the backing assembler.
      
        $ gcc -Wa,--version | head -n 1
        gcc: fatal error: no input files
        compilation terminated.
      
      OK, we need to input something to satisfy gcc.
      
        $ gcc -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      The combination of Clang and GNU assembler works in the same way:
      
        $ clang -no-integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      Clang with the integrated assembler fails like this:
      
        $ clang -integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        clang: error: unsupported argument '--version' to option 'Wa,'
      
      For the last case, checking the error message is fragile. If the
      proposal for -Wa,--version support [1] is accepted, this may not be
      even an error in the future.
      
      One easy way is to check if -integrated-as is present in the passed
      arguments. We did not pass -integrated-as to CLANG_FLAGS before, but
      we can make it explicit.
      
      Nathan pointed out -integrated-as is the default for all of the
      architectures/targets that the kernel cares about, but it goes
      along with "explicit is better than implicit" policy. [2]
      
      With all this in my mind, I implemented scripts/as-version.sh to
      check the assembler version in Kconfig time.
      
        $ scripts/as-version.sh gcc
        GNU 23501
        $ scripts/as-version.sh clang -no-integrated-as
        GNU 23501
        $ scripts/as-version.sh clang -integrated-as
        LLVM 0
      
      [1]: https://github.com/ClangBuiltLinux/linux/issues/1320
      [2]: https://lore.kernel.org/linux-kbuild/20210307044253.v3h47ucq6ng25iay@archlinux-ax161/Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      ba64beb1
  19. 14 4月, 2021 3 次提交
    • M
      kconfig: change "modules" from sub-option to first-level attribute · 6dd85ff1
      Masahiro Yamada 提交于
      Now "modules" is the only member of the "option" property.
      
      Remove "option", and move "modules" to the top level property.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      6dd85ff1
    • M
      kconfig: do not use allnoconfig_y option · f8f0d064
      Masahiro Yamada 提交于
      allnoconfig_y is an ugly hack that sets a symbol to 'y' by allnoconfig.
      
      allnoconfig does not mean a minimal set of CONFIG options because a
      bunch of prompts are hidden by 'if EMBEDDED' or 'if EXPERT', but I do
      not like to hack Kconfig this way.
      
      Use the pre-existing feature, KCONFIG_ALLCONFIG, to provide a one
      liner config fragment. CONFIG_EMBEDDED=y is still forced when
      allnoconfig is invoked as a part of tinyconfig.
      
      No change in the .config file produced by 'make tinyconfig'.
      
      The output of 'make allnoconfig' will be changed; we will get
      CONFIG_EMBEDDED=n because allnoconfig literally sets all symbols to n.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      f8f0d064
    • M
      kconfig: change defconfig_list option to environment variable · b75b0a81
      Masahiro Yamada 提交于
      "defconfig_list" is a weird option that defines a static symbol that
      declares the list of base config files in case the .config does not
      exist yet.
      
      This is quite different from other normal symbols; we just abused the
      "string" type and the "default" properties to list out the input files.
      They must be fixed values since these are searched for and loaded in
      the parse stage.
      
      It is an ugly hack, and should not exist in the first place. Providing
      this feature as an environment variable is a saner approach.
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      b75b0a81
  20. 09 4月, 2021 2 次提交
    • N
      ima: enable signing of modules with build time generated key · 0165f4ca
      Nayna Jain 提交于
      The kernel build process currently only signs kernel modules when
      MODULE_SIG is enabled. Also, sign the kernel modules at build time when
      IMA_APPRAISE_MODSIG is enabled.
      Signed-off-by: NNayna Jain <nayna@linux.ibm.com>
      Acked-by: NStefan Berger <stefanb@linux.ibm.com>
      Signed-off-by: NMimi Zohar <zohar@linux.ibm.com>
      0165f4ca
    • S
      add support for Clang CFI · cf68fffb
      Sami Tolvanen 提交于
      This change adds support for Clang’s forward-edge Control Flow
      Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
      injects a runtime check before each indirect function call to ensure
      the target is a valid function with the correct static type. This
      restricts possible call targets and makes it more difficult for
      an attacker to exploit bugs that allow the modification of stored
      function pointers. For more details, see:
      
        https://clang.llvm.org/docs/ControlFlowIntegrity.html
      
      Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
      visibility to possible call targets. Kernel modules are supported
      with Clang’s cross-DSO CFI mode, which allows checking between
      independently compiled components.
      
      With CFI enabled, the compiler injects a __cfi_check() function into
      the kernel and each module for validating local call targets. For
      cross-module calls that cannot be validated locally, the compiler
      calls the global __cfi_slowpath_diag() function, which determines
      the target module and calls the correct __cfi_check() function. This
      patch includes a slowpath implementation that uses __module_address()
      to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
      shadow map that speeds up module look-ups by ~3x.
      
      Clang implements indirect call checking using jump tables and
      offers two methods of generating them. With canonical jump tables,
      the compiler renames each address-taken function to <function>.cfi
      and points the original symbol to a jump table entry, which passes
      __cfi_check() validation. This isn’t compatible with stand-alone
      assembly code, which the compiler doesn’t instrument, and would
      result in indirect calls to assembly code to fail. Therefore, we
      default to using non-canonical jump tables instead, where the compiler
      generates a local jump table entry <function>.cfi_jt for each
      address-taken function, and replaces all references to the function
      with the address of the jump table entry.
      
      Note that because non-canonical jump table addresses are local
      to each component, they break cross-module function address
      equality. Specifically, the address of a global function will be
      different in each module, as it's replaced with the address of a local
      jump table entry. If this address is passed to a different module,
      it won’t match the address of the same function taken there. This
      may break code that relies on comparing addresses passed from other
      components.
      
      CFI checking can be disabled in a function with the __nocfi attribute.
      Additionally, CFI can be disabled for an entire compilation unit by
      filtering out CC_FLAGS_CFI.
      
      By default, CFI failures result in a kernel panic to stop a potential
      exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
      kernel prints out a rate-limited warning instead, and allows execution
      to continue. This option is helpful for locating type mismatches, but
      should only be enabled during development.
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Tested-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
      cf68fffb
  21. 08 4月, 2021 1 次提交
    • K
      stack: Optionally randomize kernel stack offset each syscall · 39218ff4
      Kees Cook 提交于
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      39218ff4
  22. 05 4月, 2021 1 次提交
    • V
      cgroup: Add misc cgroup controller · a72232ea
      Vipin Sharma 提交于
      The Miscellaneous cgroup provides the resource limiting and tracking
      mechanism for the scalar resources which cannot be abstracted like the
      other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC
      config option.
      
      A resource can be added to the controller via enum misc_res_type{} in
      the include/linux/misc_cgroup.h file and the corresponding name via
      misc_res_name[] in the kernel/cgroup/misc.c file. Provider of the
      resource must set its capacity prior to using the resource by calling
      misc_cg_set_capacity().
      
      Once a capacity is set then the resource usage can be updated using
      charge and uncharge APIs. All of the APIs to interact with misc
      controller are in include/linux/misc_cgroup.h.
      
      Miscellaneous controller provides 3 interface files. If two misc
      resources (res_a and res_b) are registered then:
      
      misc.capacity
      A read-only flat-keyed file shown only in the root cgroup.  It shows
      miscellaneous scalar resources available on the platform along with
      their quantities::
      
          $ cat misc.capacity
          res_a 50
          res_b 10
      
      misc.current
      A read-only flat-keyed file shown in the non-root cgroups.  It shows
      the current usage of the resources in the cgroup and its children::
      
          $ cat misc.current
          res_a 3
          res_b 0
      
      misc.max
      A read-write flat-keyed file shown in the non root cgroups. Allowed
      maximum usage of the resources in the cgroup and its children.::
      
          $ cat misc.max
          res_a max
          res_b 4
      
      Limit can be set by::
      
          # echo res_a 1 > misc.max
      
      Limit can be set to max by::
      
          # echo res_a max > misc.max
      
      Limits can be set more than the capacity value in the misc.capacity
      file.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Reviewed-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a72232ea
  23. 31 3月, 2021 1 次提交
  24. 19 3月, 2021 1 次提交
  25. 14 3月, 2021 1 次提交
    • M
      init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM · ea29b20a
      Masahiro Yamada 提交于
      I read the commit log of the following two:
      
      - bc083a64 ("init/Kconfig: make COMPILE_TEST depend on !UML")
      - 334ef6ed ("init/Kconfig: make COMPILE_TEST depend on !S390")
      
      Both are talking about HAS_IOMEM dependency missing in many drivers.
      
      So, 'depends on HAS_IOMEM' seems the direct, sensible solution to me.
      
      This does not change the behavior of UML. UML still cannot enable
      COMPILE_TEST because it does not provide HAS_IOMEM.
      
      The current dependency for S390 is too strong. Under the condition of
      CONFIG_PCI=y, S390 provides HAS_IOMEM, hence can enable COMPILE_TEST.
      
      I also removed the meaningless 'default n'.
      
      Link: https://lkml.kernel.org/r/20210224140809.1067582-1-masahiroy@kernel.orgSigned-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Arnd Bergmann <arnd@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KP Singh <kpsingh@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Quentin Perret <qperret@google.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: "Enrico Weigelt, metux IT consult" <lkml@metux.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea29b20a
  26. 11 3月, 2021 1 次提交
  27. 28 2月, 2021 1 次提交