1. 21 10月, 2021 1 次提交
    • A
      mm: add Kernel Electric-Fence infrastructure · e8d38c9d
      Alexander Potapenko 提交于
      mainline inclusion
      from mainline-v5.12-rc1
      commit 0ce20dd8
      category: feature
      bugzilla: 181005 https://gitee.com/openeuler/kernel/issues/I4EUY7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ce20dd840897b12ae70869c69f1ba34d6d16965
      
      -----------------------------------------------
      
      Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.
      
      This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
      low-overhead sampling-based memory safety error detector of heap
      use-after-free, invalid-free, and out-of-bounds access errors.  This
      series enables KFENCE for the x86 and arm64 architectures, and adds
      KFENCE hooks to the SLAB and SLUB allocators.
      
      KFENCE is designed to be enabled in production kernels, and has near
      zero performance overhead. Compared to KASAN, KFENCE trades performance
      for precision. The main motivation behind KFENCE's design, is that with
      enough total uptime KFENCE will detect bugs in code paths not typically
      exercised by non-production test workloads. One way to quickly achieve a
      large enough total uptime is when the tool is deployed across a large
      fleet of machines.
      
      KFENCE objects each reside on a dedicated page, at either the left or
      right page boundaries. The pages to the left and right of the object
      page are "guard pages", whose attributes are changed to a protected
      state, and cause page faults on any attempted access to them. Such page
      faults are then intercepted by KFENCE, which handles the fault
      gracefully by reporting a memory access error.
      
      Guarded allocations are set up based on a sample interval (can be set
      via kfence.sample_interval). After expiration of the sample interval,
      the next allocation through the main allocator (SLAB or SLUB) returns a
      guarded allocation from the KFENCE object pool. At this point, the timer
      is reset, and the next allocation is set up after the expiration of the
      interval.
      
      To enable/disable a KFENCE allocation through the main allocator's
      fast-path without overhead, KFENCE relies on static branches via the
      static keys infrastructure. The static branch is toggled to redirect the
      allocation to KFENCE.
      
      The KFENCE memory pool is of fixed size, and if the pool is exhausted no
      further KFENCE allocations occur. The default config is conservative
      with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
      pages).
      
      We have verified by running synthetic benchmarks (sysbench I/O,
      hackbench) and production server-workload benchmarks that a kernel with
      KFENCE (using sample intervals 100-500ms) is performance-neutral
      compared to a non-KFENCE baseline kernel.
      
      KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
      properties. The name "KFENCE" is a homage to the Electric Fence Malloc
      Debugger [2].
      
      For more details, see Documentation/dev-tools/kfence.rst added in the
      series -- also viewable here:
      
      	https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst
      
      [1] http://llvm.org/docs/GwpAsan.html
      [2] https://linux.die.net/man/3/efence
      
      This patch (of 9):
      
      This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
      low-overhead sampling-based memory safety error detector of heap
      use-after-free, invalid-free, and out-of-bounds access errors.
      
      KFENCE is designed to be enabled in production kernels, and has near
      zero performance overhead. Compared to KASAN, KFENCE trades performance
      for precision. The main motivation behind KFENCE's design, is that with
      enough total uptime KFENCE will detect bugs in code paths not typically
      exercised by non-production test workloads. One way to quickly achieve a
      large enough total uptime is when the tool is deployed across a large
      fleet of machines.
      
      KFENCE objects each reside on a dedicated page, at either the left or
      right page boundaries. The pages to the left and right of the object
      page are "guard pages", whose attributes are changed to a protected
      state, and cause page faults on any attempted access to them. Such page
      faults are then intercepted by KFENCE, which handles the fault
      gracefully by reporting a memory access error. To detect out-of-bounds
      writes to memory within the object's page itself, KFENCE also uses
      pattern-based redzones. The following figure illustrates the page
      layout:
      
        ---+-----------+-----------+-----------+-----------+-----------+---
           | xxxxxxxxx | O :       | xxxxxxxxx |       : O | xxxxxxxxx |
           | xxxxxxxxx | B :       | xxxxxxxxx |       : B | xxxxxxxxx |
           | x GUARD x | J : RED-  | x GUARD x | RED-  : J | x GUARD x |
           | xxxxxxxxx | E :  ZONE | xxxxxxxxx |  ZONE : E | xxxxxxxxx |
           | xxxxxxxxx | C :       | xxxxxxxxx |       : C | xxxxxxxxx |
           | xxxxxxxxx | T :       | xxxxxxxxx |       : T | xxxxxxxxx |
        ---+-----------+-----------+-----------+-----------+-----------+---
      
      Guarded allocations are set up based on a sample interval (can be set
      via kfence.sample_interval). After expiration of the sample interval, a
      guarded allocation from the KFENCE object pool is returned to the main
      allocator (SLAB or SLUB). At this point, the timer is reset, and the
      next allocation is set up after the expiration of the interval.
      
      To enable/disable a KFENCE allocation through the main allocator's
      fast-path without overhead, KFENCE relies on static branches via the
      static keys infrastructure. The static branch is toggled to redirect the
      allocation to KFENCE. To date, we have verified by running synthetic
      benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
      is performance-neutral compared to the non-KFENCE baseline.
      
      For more details, see Documentation/dev-tools/kfence.rst (added later in
      the series).
      
      [elver@google.com: fix parameter description for kfence_object_start()]
        Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
      [elver@google.com: avoid stalling work queue task without allocations]
        Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
        Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
      [elver@google.com: fix potential deadlock due to wake_up()]
        Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
        Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
      [elver@google.com: add option to use KFENCE without static keys]
        Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
      [elver@google.com: add missing copyright and description headers]
        Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com
      
      Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NSeongJae Park <sjpark@amazon.de>
      Co-developed-by: NMarco Elver <elver@google.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	init/main.c
      [Peng Liu: cherry-pick from 0ce20dd8]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NYingjie Shang <1415317271@qq.com>
      Reviewed-by: NBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      e8d38c9d
  2. 13 10月, 2021 1 次提交
  3. 12 10月, 2021 1 次提交
    • X
      init: only move down lockup_detector_init() when sdei_watchdog is enabled · 60565144
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 173968 https://gitee.com/openeuler/kernel/issues/I4DDEL
      
      -------------------------------------------------
      
      When I enable CONFIG_DEBUG_PREEMPT and CONFIG_PREEMPT on X86, I got the
      following Call Trace:
      
      [    3.341853] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
      [    3.344392] caller is debug_smp_processor_id+0x17/0x20
      [    3.344395] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.10.0+ #398
      [    3.344397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
      [    3.344399] Call Trace:
      [    3.344410]  dump_stack+0x60/0x76
      [    3.344412]  check_preemption_disabled+0xba/0xc0
      [    3.344415]  debug_smp_processor_id+0x17/0x20
      [    3.344422]  hardlockup_detector_event_create+0xf/0x60
      [    3.344427]  hardlockup_detector_perf_init+0xf/0x41
      [    3.344430]  watchdog_nmi_probe+0xe/0x10
      [    3.344432]  lockup_detector_init+0x22/0x5b
      [    3.344437]  kernel_init_freeable+0x20c/0x245
      [    3.344439]  ? rest_init+0xd0/0xd0
      [    3.344441]  kernel_init+0xe/0x110
      [    3.344446]  ret_from_fork+0x22/0x30
      
      It is because sched_init_smp() set 'current->nr_cpus_allowed' to
      possible cpu number, and check_preemption_disabled() failed. This issue
      is introduced by commit a7905043, which move down
      lockup_detector_init() after do_basic_setup(). Fix it by moving
      lockup_detector_init() to its origin place when sdei_watchdog is
      disabled. There is no problem when sdei_watchdog is enabled because
      watchdog_nmi_probe() is overridden in
      'arch/arm64/kernel/watchdog_sdei.c' in this case.
      
      Fixes: a7905043 ("lockup_detector: init lockup detector after all the init_calls")
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      60565144
  4. 28 9月, 2021 1 次提交
  5. 14 7月, 2021 1 次提交
  6. 15 6月, 2021 1 次提交
    • M
      pid: take a reference when initializing `cad_pid` · 11774e84
      Mark Rutland 提交于
      stable inclusion
      from stable-5.10.43
      commit 7178be006d495ffb741c329012da289b62dddfe6
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 0711f0d7 upstream.
      
      During boot, kernel_init_freeable() initializes `cad_pid` to the init
      task's struct pid.  Later on, we may change `cad_pid` via a sysctl, and
      when this happens proc_do_cad_pid() will increment the refcount on the
      new pid via get_pid(), and will decrement the refcount on the old pid
      via put_pid().  As we never called get_pid() when we initialized
      `cad_pid`, we decrement a reference we never incremented, can therefore
      free the init task's struct pid early.  As there can be dangling
      references to the struct pid, we can later encounter a use-after-free
      (e.g.  when delivering signals).
      
      This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
      have been around since the conversion of `cad_pid` to struct pid in
      commit 9ec52099 ("[PATCH] replace cad_pid by a struct pid") from the
      pre-KASAN stone age of v2.6.19.
      
      Fix this by getting a reference to the init task's struct pid when we
      assign it to `cad_pid`.
      
      Full KASAN splat below.
      
         ==================================================================
         BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
         BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
         Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
      
         CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
         Hardware name: linux,dummy-virt (DT)
         Call trace:
          ns_of_pid include/linux/pid.h:153 [inline]
          task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
          do_notify_parent+0x308/0xe60 kernel/signal.c:1950
          exit_notify kernel/exit.c:682 [inline]
          do_exit+0x2334/0x2bd0 kernel/exit.c:845
          do_group_exit+0x108/0x2c8 kernel/exit.c:922
          get_signal+0x4e4/0x2a88 kernel/signal.c:2781
          do_signal arch/arm64/kernel/signal.c:882 [inline]
          do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
          work_pending+0xc/0x2dc
      
         Allocated by task 0:
          slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
          slab_alloc_node mm/slub.c:2907 [inline]
          slab_alloc mm/slub.c:2915 [inline]
          kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
          alloc_pid+0xdc/0xc00 kernel/pid.c:180
          copy_process+0x2794/0x5e18 kernel/fork.c:2129
          kernel_clone+0x194/0x13c8 kernel/fork.c:2500
          kernel_thread+0xd4/0x110 kernel/fork.c:2552
          rest_init+0x44/0x4a0 init/main.c:687
          arch_call_rest_init+0x1c/0x28
          start_kernel+0x520/0x554 init/main.c:1064
          0x0
      
         Freed by task 270:
          slab_free_hook mm/slub.c:1562 [inline]
          slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
          slab_free mm/slub.c:3161 [inline]
          kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
          put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
          put_pid+0x30/0x48 kernel/pid.c:109
          proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
          proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
          proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
          call_write_iter include/linux/fs.h:1977 [inline]
          new_sync_write+0x3ac/0x510 fs/read_write.c:518
          vfs_write fs/read_write.c:605 [inline]
          vfs_write+0x9c4/0x1018 fs/read_write.c:585
          ksys_write+0x124/0x240 fs/read_write.c:658
          __do_sys_write fs/read_write.c:670 [inline]
          __se_sys_write fs/read_write.c:667 [inline]
          __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
          __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
          invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
          el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
          do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
          el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
          el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
          el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
      
         The buggy address belongs to the object at ffff23794dda0000
          which belongs to the cache pid of size 224
         The buggy address is located 4 bytes inside of
          224-byte region [ffff23794dda0000, ffff23794dda00e0)
         The buggy address belongs to the page:
         page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
         head:(____ptrval____) order:1 compound_mapcount:0
         flags: 0x3fffc0000010200(slab|head)
         raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
         raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
         page dumped because: kasan: bad access detected
      
         Memory state around the buggy address:
          ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
          ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
          ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
         ==================================================================
      
      Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
      Fixes: 9ec52099 ("[PATCH] replace cad_pid by a struct pid")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      11774e84
  7. 09 4月, 2021 1 次提交
  8. 08 4月, 2021 1 次提交
  9. 28 1月, 2021 1 次提交
  10. 07 1月, 2021 1 次提交
  11. 01 12月, 2020 1 次提交
  12. 13 11月, 2020 1 次提交
  13. 10 10月, 2020 1 次提交
  14. 19 9月, 2020 2 次提交
  15. 01 9月, 2020 1 次提交
  16. 08 8月, 2020 1 次提交
  17. 05 8月, 2020 2 次提交
  18. 31 7月, 2020 4 次提交
  19. 21 7月, 2020 1 次提交
  20. 16 6月, 2020 1 次提交
    • G
      security: allow using Clang's zero initialization for stack variables · f0fe00d4
      glider@google.com 提交于
      In addition to -ftrivial-auto-var-init=pattern (used by
      CONFIG_INIT_STACK_ALL now) Clang also supports zero initialization for
      locals enabled by -ftrivial-auto-var-init=zero. The future of this flag
      is still being debated (see https://bugs.llvm.org/show_bug.cgi?id=45497).
      Right now it is guarded by another flag,
      -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang,
      which means it may not be supported by future Clang releases. Another
      possible resolution is that -ftrivial-auto-var-init=zero will persist
      (as certain users have already started depending on it), but the name
      of the guard flag will change.
      
      In the meantime, zero initialization has proven itself as a good
      production mitigation measure against uninitialized locals. Unlike pattern
      initialization, which has a higher chance of triggering existing bugs,
      zero initialization provides safe defaults for strings, pointers, indexes,
      and sizes. On the other hand, pattern initialization remains safer for
      return values. Chrome OS and Android are moving to using zero
      initialization for production builds.
      
      Performance-wise, the difference between pattern and zero initialization
      is usually negligible, although the generated code for zero
      initialization is more compact.
      
      This patch renames CONFIG_INIT_STACK_ALL to CONFIG_INIT_STACK_ALL_PATTERN
      and introduces another config option, CONFIG_INIT_STACK_ALL_ZERO, that
      enables zero initialization for locals if the corresponding flags are
      supported by Clang.
      
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Link: https://lore.kernel.org/r/20200616083435.223038-1-glider@google.comReviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      f0fe00d4
  21. 09 6月, 2020 1 次提交
    • V
      kernel/sysctl: support setting sysctl parameters from kernel command line · 3db978d4
      Vlastimil Babka 提交于
      Patch series "support setting sysctl parameters from kernel command line", v3.
      
      This series adds support for something that seems like many people
      always wanted but nobody added it yet, so here's the ability to set
      sysctl parameters via kernel command line options in the form of
      sysctl.vm.something=1
      
      The important part is Patch 1.  The second, not so important part is an
      attempt to clean up legacy one-off parameters that do the same thing as
      a sysctl.  I don't want to remove them completely for compatibility
      reasons, but with generic sysctl support the idea is to remove the
      one-off param handlers and treat the parameters as aliases for the
      sysctl variants.
      
      I have identified several parameters that mention sysctl counterparts in
      Documentation/admin-guide/kernel-parameters.txt but there might be more.
      The conversion also has varying level of success:
      
       - numa_zonelist_order is converted in Patch 2 together with adding the
         necessary infrastructure. It's easy as it doesn't really do anything
         but warn on deprecated value these days.
      
       - hung_task_panic is converted in Patch 3, but there's a downside that
         now it only accepts 0 and 1, while previously it was any integer
         value
      
       - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
         so there's no straighforward conversion possible
      
       - traceoff_on_warning is a flag without value and it would be required
         to handle that somehow in the conversion infractructure, which seems
         pointless for a single flag
      
      This patch (of 5):
      
      A recently proposed patch to add vm_swappiness command line parameter in
      addition to existing sysctl [1] made me wonder why we don't have a
      general support for passing sysctl parameters via command line.
      
      Googling found only somebody else wondering the same [2], but I haven't
      found any prior discussion with reasons why not to do this.
      
      Settings the vm_swappiness issue aside (the underlying issue might be
      solved in a different way), quick search of kernel-parameters.txt shows
      there are already some that exist as both sysctl and kernel parameter -
      hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
      
      A general mechanism would remove the need to add more of those one-offs
      and might be handy in situations where configuration by e.g.
      /etc/sysctl.d/ is impractical.
      
      Hence, this patch adds a new parse_args() pass that looks for parameters
      prefixed by 'sysctl.' and tries to interpret them as writes to the
      corresponding sys/ files using an temporary in-kernel procfs mount.
      This mechanism was suggested by Eric W.  Biederman [3], as it handles
      all dynamically registered sysctl tables, even though we don't handle
      modular sysctls.  Errors due to e.g.  invalid parameter name or value
      are reported in the kernel log.
      
      The processing is hooked right before the init process is loaded, as
      some handlers might be more complicated than simple setters and might
      need some subsystems to be initialized.  At the moment the init process
      can be started and eventually execute a process writing to /proc/sys/
      then it should be also fine to do that from the kernel.
      
      Sysctls registered later on module load time are not set by this
      mechanism - it's expected that in such scenarios, setting sysctl values
      from userspace is practical enough.
      
      [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
      [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
      [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
      Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3db978d4
  22. 05 6月, 2020 1 次提交
    • C
      init: allow distribution configuration of default init · ada4ab7a
      Chris Down 提交于
      Some init systems (eg.  systemd) have init at their own paths, for
      example, /usr/lib/systemd/systemd.  A compatibility symlink to one of the
      hardcoded init paths is provided by another package, usually named
      something like systemd-sysvcompat or similar.
      
      Currently distro maintainers who are hands-off on the bootloader are more
      or less required to include those compatibility links as part of their
      base distribution, because it's hard to migrate away from them since
      there's a risk some users will not get the message to set init= on the
      kernel command line appropriately.
      
      Moreover, for distributions where the init system is something the
      distribution itself is opinionated about (eg.  Arch, which has systemd in
      the required `base` package), we could usually reasonably configure this
      ahead of time when building the distribution kernel.  However, we
      currently simply don't have any way to configure the kernel to do this.
      Here's an example discussion where removing sysvcompat was discussed by
      distro maintainers[0].
      
      This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
      is tried before the hardcoded fallback list.  So the order of precedence
      is now thus:
      
      1. init= on command line (on failure: panic)
      2. CONFIG_DEFAULT_INIT (on failure: try #3)
      3. Hardcoded fallback list (on failure: panic)
      
      This new config parameter will allow distribution maintainers to move away
      from these compatibility links safely, without having to worry that their
      users might not have the right init=.
      
      There are also two other benefits of this over having the distribution
      maintain a symlink:
      
      1. One of the value propositions over simply having distributions
         maintain a /sbin/init symlink via a package is that it also frees
         distributions which have a preferred default, but not mandatory, init
         system from having their package manager fight with their users for
         control of /{s,}bin/init.  Instead, the distribution simply makes
         their preference known in CONFIG_DEFAULT_INIT, and if the user
         installs another init system and uninstalls the default one they can
         still make use of /{s,}bin/init and friends for their own uses. This
         makes more cases Just Work(tm) without the user having to perform
         extra configuration via init=.
      
      2. Since before this we don't know which path the distribution actually
         _intends_ to serve init from, we don't pr_err if it is simply
         missing, and usually will just silently put the user in a /bin/sh
         shell. Now that the distribution can make a declaration of intent, we
         can be more vocal when this init system fails to launch for any
         reason, even if it's simply because no file exists at that location,
         speeding up the palaver of init/mount dependency/etc debugging a bit.
      
      [0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.htmlSigned-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ada4ab7a
  23. 04 6月, 2020 1 次提交
    • D
      padata: initialize earlier · f1b192b1
      Daniel Jordan 提交于
      padata will soon initialize the system's struct pages in parallel, so it
      needs to be ready by page_alloc_init_late().
      
      The error return from padata_driver_init() triggers an initcall warning,
      so add a warning to padata_init() to avoid silent failure.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1b192b1
  24. 15 5月, 2020 1 次提交
    • B
      x86: Fix early boot crash on gcc-10, third try · a9a3ed1e
      Borislav Petkov 提交于
      ... or the odyssey of trying to disable the stack protector for the
      function which generates the stack canary value.
      
      The whole story started with Sergei reporting a boot crash with a kernel
      built with gcc-10:
      
        Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
        Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
        Call Trace:
          dump_stack
          panic
          ? start_secondary
          __stack_chk_fail
          start_secondary
          secondary_startup_64
        -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
      
      This happens because gcc-10 tail-call optimizes the last function call
      in start_secondary() - cpu_startup_entry() - and thus emits a stack
      canary check which fails because the canary value changes after the
      boot_init_stack_canary() call.
      
      To fix that, the initial attempt was to mark the one function which
      generates the stack canary with:
      
        __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
      
      however, using the optimize attribute doesn't work cumulatively
      as the attribute does not add to but rather replaces previously
      supplied optimization options - roughly all -fxxx options.
      
      The key one among them being -fno-omit-frame-pointer and thus leading to
      not present frame pointer - frame pointer which the kernel needs.
      
      The next attempt to prevent compilers from tail-call optimizing
      the last function call cpu_startup_entry(), shy of carving out
      start_secondary() into a separate compilation unit and building it with
      -fno-stack-protector, was to add an empty asm("").
      
      This current solution was short and sweet, and reportedly, is supported
      by both compilers but we didn't get very far this time: future (LTO?)
      optimization passes could potentially eliminate this, which leads us
      to the third attempt: having an actual memory barrier there which the
      compiler cannot ignore or move around etc.
      
      That should hold for a long time, but hey we said that about the other
      two solutions too so...
      Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
      a9a3ed1e
  25. 12 5月, 2020 1 次提交
  26. 06 5月, 2020 1 次提交
  27. 11 4月, 2020 1 次提交
  28. 04 3月, 2020 1 次提交
  29. 21 2月, 2020 4 次提交
  30. 11 2月, 2020 1 次提交
  31. 06 2月, 2020 2 次提交