1. 24 3月, 2022 2 次提交
  2. 22 1月, 2022 1 次提交
    • V
      lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() · 2dba5eb1
      Vlastimil Babka 提交于
      Currently, enabling CONFIG_STACKDEPOT means its stack_table will be
      allocated from memblock, even if stack depot ends up not actually used.
      The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit.
      
      This is fine for use-cases such as KASAN which is also a config option
      and has overhead on its own.  But it's an issue for functionality that
      has to be actually enabled on boot (page_owner) or depends on hardware
      (GPU drivers) and thus the memory might be wasted.  This was raised as
      an issue [1] when attempting to add stackdepot support for SLUB's debug
      object tracking functionality.  It's common to build kernels with
      CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or
      create only specific kmem caches with debugging for testing purposes.
      
      It would thus be more efficient if stackdepot's table was allocated only
      when actually going to be used.  This patch thus makes the allocation
      (and whole stack_depot_init() call) optional:
      
       - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current
         well-defined point of allocation as part of mem_init(). Make
         CONFIG_KASAN select this flag.
      
       - Other users have to call stack_depot_init() as part of their own init
         when it's determined that stack depot will actually be used. This may
         depend on both config and runtime conditions. Convert current users
         which are page_owner and several in the DRM subsystem. Same will be
         done for SLUB later.
      
       - Because the init might now be called after the boot-time memblock
         allocation has given all memory to the buddy allocator, change
         stack_depot_init() to allocate stack_table with kvmalloc() when
         memblock is no longer available. Also handle allocation failure by
         disabling stackdepot (could have theoretically happened even with
         memblock allocation previously), and don't unnecessarily align the
         memblock allocation to its own size anymore.
      
      [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: Marco Elver <elver@google.com> # stackdepot
      Cc: Marco Elver <elver@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Oliver Glitta <glittao@gmail.com>
      Cc: Imran Khan <imran.f.khan@oracle.com>
      From: Colin Ian King <colin.king@canonical.com>
      Subject: lib/stackdepot: fix spelling mistake and grammar in pr_err message
      
      There is a spelling mistake of the work allocation so fix this and
      re-phrase the message to make it easier to read.
      
      Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      From: Vlastimil Babka <vbabka@suse.cz>
      Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup
      
      On FLATMEM, we call page_ext_init_flatmem_late() just before
      kmem_cache_init() which means stack_depot_init() (called by page owner
      init) will not recognize properly it should use kvmalloc() and not
      memblock_alloc().  memblock_alloc() will also not issue a warning and
      return a block memory that can be invalid and cause kernel page fault when
      saving stacks, as reported by the kernel test robot [1].
      
      Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so
      that slab_is_available() is true during stack_depot_init().  SPARSEMEM
      doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(),
      but a different page_ext_init() even later in the boot process.
      
      Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue.
      
      While at it, also actually resolve a checkpatch warning in stack_depot_init()
      from DRM CI, which was supposed to be in the original patch already.
      
      [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/
      
      Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      From: Vlastimil Babka <vbabka@suse.cz>
      Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3
      
      Due to cd06ab2f ("drm/locking: add backtrace for locking contended
      locks without backoff") landing recently to -next adding a new stack depot
      user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate
      call to stack_depot_init() there as well.
      
      Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Oliver Glitta <glittao@gmail.com>
      Cc: Imran Khan <imran.f.khan@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      From: Vlastimil Babka <vbabka@suse.cz>
      Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup4
      
      Due to 4e66934e ("lib: add reference counting tracking
      infrastructure") landing recently to net-next adding a new stack depot
      user in lib/ref_tracker.c we need to add an appropriate call to
      stack_depot_init() there as well.
      
      Link: https://lkml.kernel.org/r/45c1b738-1a2f-5b5f-2f6d-86fab206d01c@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Slab <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dba5eb1
  3. 10 11月, 2021 1 次提交
  4. 07 11月, 2021 2 次提交
  5. 18 10月, 2021 1 次提交
  6. 11 10月, 2021 4 次提交
  7. 23 9月, 2021 1 次提交
  8. 15 9月, 2021 1 次提交
    • L
      memblock: introduce saner 'memblock_free_ptr()' interface · 77e02cf5
      Linus Torvalds 提交于
      The boot-time allocation interface for memblock is a mess, with
      'memblock_alloc()' returning a virtual pointer, but then you are
      supposed to free it with 'memblock_free()' that takes a _physical_
      address.
      
      Not only is that all kinds of strange and illogical, but it actually
      causes bugs, when people then use it like a normal allocation function,
      and it fails spectacularly on a NULL pointer:
      
         https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
      
      or just random memory corruption if the debug checks don't catch it:
      
         https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
      
      I really don't want to apply patches that treat the symptoms, when the
      fundamental cause is this horribly confusing interface.
      
      I started out looking at just automating a sane replacement sequence,
      but because of this mix or virtual and physical addresses, and because
      people have used the "__pa()" macro that can take either a regular
      kernel pointer, or just the raw "unsigned long" address, it's all quite
      messy.
      
      So this just introduces a new saner interface for freeing a virtual
      address that was allocated using 'memblock_alloc()', and that was kept
      as a regular kernel pointer.  And then it converts a couple of users
      that are obvious and easy to test, including the 'xbc_nodes' case in
      lib/bootconfig.c that caused problems.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e02cf5
  9. 09 9月, 2021 4 次提交
  10. 13 8月, 2021 1 次提交
  11. 03 8月, 2021 1 次提交
  12. 09 7月, 2021 1 次提交
  13. 02 7月, 2021 1 次提交
    • A
      init: print out unknown kernel parameters · 86d1919a
      Andrew Halaney 提交于
      It is easy to foobar setting a kernel parameter on the command line
      without realizing it, there's not much output that you can use to assess
      what the kernel did with that parameter by default.
      
      Make it a little more explicit which parameters on the command line
      _looked_ like a valid parameter for the kernel, but did not match anything
      and ultimately got tossed to init.  This is very similar to the unknown
      parameter message received when loading a module.
      
      This assumes the parameters are processed in a normal fashion, some
      parameters (dyndbg= for example) don't register their parameter with the
      rest of the kernel's parameters, and therefore always show up in this list
      (and are also given to init - like the rest of this list).
      
      Another example is BOOT_IMAGE= is highlighted as an offender, which it
      technically is, but is passed by LILO and GRUB so most systems will see
      that complaint.
      
      An example output where "foobared" and "unrecognized" are intentionally
      invalid parameters:
      
        Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
        Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo
      
      Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.comSigned-off-by: NAndrew Halaney <ahalaney@redhat.com>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86d1919a
  14. 11 6月, 2021 1 次提交
  15. 05 6月, 2021 1 次提交
    • M
      pid: take a reference when initializing `cad_pid` · 0711f0d7
      Mark Rutland 提交于
      During boot, kernel_init_freeable() initializes `cad_pid` to the init
      task's struct pid.  Later on, we may change `cad_pid` via a sysctl, and
      when this happens proc_do_cad_pid() will increment the refcount on the
      new pid via get_pid(), and will decrement the refcount on the old pid
      via put_pid().  As we never called get_pid() when we initialized
      `cad_pid`, we decrement a reference we never incremented, can therefore
      free the init task's struct pid early.  As there can be dangling
      references to the struct pid, we can later encounter a use-after-free
      (e.g.  when delivering signals).
      
      This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
      have been around since the conversion of `cad_pid` to struct pid in
      commit 9ec52099 ("[PATCH] replace cad_pid by a struct pid") from the
      pre-KASAN stone age of v2.6.19.
      
      Fix this by getting a reference to the init task's struct pid when we
      assign it to `cad_pid`.
      
      Full KASAN splat below.
      
         ==================================================================
         BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
         BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
         Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
      
         CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
         Hardware name: linux,dummy-virt (DT)
         Call trace:
          ns_of_pid include/linux/pid.h:153 [inline]
          task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
          do_notify_parent+0x308/0xe60 kernel/signal.c:1950
          exit_notify kernel/exit.c:682 [inline]
          do_exit+0x2334/0x2bd0 kernel/exit.c:845
          do_group_exit+0x108/0x2c8 kernel/exit.c:922
          get_signal+0x4e4/0x2a88 kernel/signal.c:2781
          do_signal arch/arm64/kernel/signal.c:882 [inline]
          do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
          work_pending+0xc/0x2dc
      
         Allocated by task 0:
          slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
          slab_alloc_node mm/slub.c:2907 [inline]
          slab_alloc mm/slub.c:2915 [inline]
          kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
          alloc_pid+0xdc/0xc00 kernel/pid.c:180
          copy_process+0x2794/0x5e18 kernel/fork.c:2129
          kernel_clone+0x194/0x13c8 kernel/fork.c:2500
          kernel_thread+0xd4/0x110 kernel/fork.c:2552
          rest_init+0x44/0x4a0 init/main.c:687
          arch_call_rest_init+0x1c/0x28
          start_kernel+0x520/0x554 init/main.c:1064
          0x0
      
         Freed by task 270:
          slab_free_hook mm/slub.c:1562 [inline]
          slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
          slab_free mm/slub.c:3161 [inline]
          kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
          put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
          put_pid+0x30/0x48 kernel/pid.c:109
          proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
          proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
          proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
          call_write_iter include/linux/fs.h:1977 [inline]
          new_sync_write+0x3ac/0x510 fs/read_write.c:518
          vfs_write fs/read_write.c:605 [inline]
          vfs_write+0x9c4/0x1018 fs/read_write.c:585
          ksys_write+0x124/0x240 fs/read_write.c:658
          __do_sys_write fs/read_write.c:670 [inline]
          __se_sys_write fs/read_write.c:667 [inline]
          __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
          __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
          invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
          el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
          do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
          el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
          el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
          el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
      
         The buggy address belongs to the object at ffff23794dda0000
          which belongs to the cache pid of size 224
         The buggy address is located 4 bytes inside of
          224-byte region [ffff23794dda0000, ffff23794dda00e0)
         The buggy address belongs to the page:
         page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
         head:(____ptrval____) order:1 compound_mapcount:0
         flags: 0x3fffc0000010200(slab|head)
         raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
         raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
         page dumped because: kasan: bad access detected
      
         Memory state around the buggy address:
          ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
          ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
          ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
         ==================================================================
      
      Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
      Fixes: 9ec52099 ("[PATCH] replace cad_pid by a struct pid")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0711f0d7
  16. 01 6月, 2021 1 次提交
  17. 12 5月, 2021 1 次提交
    • V
      sched/core: Initialize the idle task with preemption disabled · f1a0a376
      Valentin Schneider 提交于
      As pointed out by commit
      
        de9b8f5d ("sched: Fix crash trying to dequeue/enqueue the idle thread")
      
      init_idle() can and will be invoked more than once on the same idle
      task. At boot time, it is invoked for the boot CPU thread by
      sched_init(). Then smp_init() creates the threads for all the secondary
      CPUs and invokes init_idle() on them.
      
      As the hotplug machinery brings the secondaries to life, it will issue
      calls to idle_thread_get(), which itself invokes init_idle() yet again.
      In this case it's invoked twice more per secondary: at _cpu_up(), and at
      bringup_cpu().
      
      Given smp_init() already initializes the idle tasks for all *possible*
      CPUs, no further initialization should be required. Now, removing
      init_idle() from idle_thread_get() exposes some interesting expectations
      with regards to the idle task's preempt_count: the secondary startup always
      issues a preempt_disable(), requiring some reset of the preempt count to 0
      between hot-unplug and hotplug, which is currently served by
      idle_thread_get() -> idle_init().
      
      Given the idle task is supposed to have preemption disabled once and never
      see it re-enabled, it seems that what we actually want is to initialize its
      preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
      init_idle() from idle_thread_get().
      
      Secondary startups were patched via coccinelle:
      
        @begone@
        @@
      
        -preempt_disable();
        ...
        cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
      f1a0a376
  18. 11 5月, 2021 1 次提交
    • F
      srcu: Initialize SRCU after timers · 8e9c01c7
      Frederic Weisbecker 提交于
      Once srcu_init() is called, the SRCU core will make use of delayed
      workqueues, which rely on timers.  However init_timers() is called
      several steps after rcu_init().  This means that a call_srcu() after
      rcu_init() but before init_timers() would find itself within a dangerously
      uninitialized timer core.
      
      This commit therefore creates a separate call to srcu_init() after
      init_timer() completes, which ensures that we stay in early SRCU mode
      until timers are safe(r).
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e9c01c7
  19. 07 5月, 2021 1 次提交
    • R
      init/initramfs.c: do unpacking asynchronously · e7cb072e
      Rasmus Villemoes 提交于
      Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
      
      These two patches are independent, but better-together.
      
      The second is a rather trivial patch that simply allows the developer to
      change "/sbin/modprobe" to something else - e.g.  the empty string, so
      that all request_module() during early boot return -ENOENT early, without
      even spawning a usermode helper, needlessly synchronizing with the
      initramfs unpacking.
      
      The first patch delegates decompressing the initramfs to a worker thread,
      allowing do_initcalls() in main.c to proceed to the device_ and late_
      initcalls without waiting for that decompression (and populating of
      rootfs) to finish.  Obviously, some of those later calls may rely on the
      initramfs being available, so I've added synchronization points in the
      firmware loader and usermodehelper paths - there might be other places
      that would need this, but so far no one has been able to think of any
      places I have missed.
      
      There's not much to win if most of the functionality needed during boot is
      only available as modules.  But systems with a custom-made .config and
      initramfs can boot faster, partly due to utilizing more than one cpu
      earlier, partly by avoiding known-futile modprobe calls (which would still
      trigger synchronization with the initramfs unpacking, thus eliminating
      most of the first benefit).
      
      This patch (of 2):
      
      Most of the boot process doesn't actually need anything from the
      initramfs, until of course PID1 is to be executed.  So instead of doing
      the decompressing and populating of the initramfs synchronously in
      populate_rootfs() itself, push that off to a worker thread.
      
      This is primarily motivated by an embedded ppc target, where unpacking
      even the rather modest sized initramfs takes 0.6 seconds, which is long
      enough that the external watchdog becomes unhappy that it doesn't get
      attention soon enough.  By doing the initramfs decompression in a worker
      thread, we get to do the device_initcalls and hence start petting the
      watchdog much sooner.
      
      Normal desktops might benefit as well.  On my mostly stock Ubuntu kernel,
      my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
      That takes almost two seconds:
      
      [    0.201454] Trying to unpack rootfs image as initramfs...
      [    1.976633] Freeing initrd memory: 29416K
      
      Before this patch, these lines occur consecutively in dmesg.  With this
      patch, the timestamps on these two lines is roughly the same as above, but
      with 172 lines inbetween - so more than one cpu has been kept busy doing
      work that would otherwise only happen after the populate_rootfs()
      finished.
      
      Should one of the initcalls done after rootfs_initcall time (i.e., device_
      and late_ initcalls) need something from the initramfs (say, a kernel
      module or a firmware blob), it will simply wait for the initramfs
      unpacking to be done before proceeding, which should in theory make this
      completely safe.
      
      But if some driver pokes around in the filesystem directly and not via one
      of the official kernel interfaces (i.e.  request_firmware*(),
      call_usermodehelper*) that theory may not hold - also, I certainly might
      have missed a spot when sprinkling wait_for_initramfs().  So there is an
      escape hatch in the form of an initramfs_async= command line parameter.
      
      Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
      Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7cb072e
  20. 01 5月, 2021 2 次提交
  21. 08 4月, 2021 1 次提交
    • K
      stack: Optionally randomize kernel stack offset each syscall · 39218ff4
      Kees Cook 提交于
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      39218ff4
  22. 19 3月, 2021 1 次提交
  23. 27 2月, 2021 3 次提交
    • S
      kgdb: fix to kill breakpoints on initmem after boot · d54ce615
      Sumit Garg 提交于
      Currently breakpoints in kernel .init.text section are not handled
      correctly while allowing to remove them even after corresponding pages
      have been freed.
      
      Fix it via killing .init.text section breakpoints just prior to initmem
      pages being freed.
      
      Doug: "HW breakpoints aren't handled by this patch but it's probably
      not such a big deal".
      
      Link: https://lkml.kernel.org/r/20210224081652.587785-1-sumit.garg@linaro.orgSigned-off-by: NSumit Garg <sumit.garg@linaro.org>
      Suggested-by: NDoug Anderson <dianders@chromium.org>
      Acked-by: NDoug Anderson <dianders@chromium.org>
      Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Tested-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d54ce615
    • V
      lib: stackdepot: add support to disable stack depot · e1fdc403
      Vijayanand Jitta 提交于
      Add a kernel parameter stack_depot_disable to disable stack depot.  So
      that stack hash table doesn't consume any memory when stack depot is
      disabled.
      
      The use case is CONFIG_PAGE_OWNER without page_owner=on.  Without this
      patch, stackdepot will consume the memory for the hashtable.  By default,
      it's 8M which is never trivial.
      
      With this option, in CONFIG_PAGE_OWNER configured system, page_owner=off,
      stack_depot_disable in kernel command line, we could save the wasted
      memory for the hashtable.
      
      [akpm@linux-foundation.org: fix CONFIG_STACKDEPOT=n build]
      
      Link: https://lkml.kernel.org/r/1611749198-24316-2-git-send-email-vjitta@codeaurora.orgSigned-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NVijayanand Jitta <vjitta@codeaurora.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yogesh Lal <ylal@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1fdc403
    • A
      mm: add Kernel Electric-Fence infrastructure · 0ce20dd8
      Alexander Potapenko 提交于
      Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.
      
      This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
      low-overhead sampling-based memory safety error detector of heap
      use-after-free, invalid-free, and out-of-bounds access errors.  This
      series enables KFENCE for the x86 and arm64 architectures, and adds
      KFENCE hooks to the SLAB and SLUB allocators.
      
      KFENCE is designed to be enabled in production kernels, and has near
      zero performance overhead. Compared to KASAN, KFENCE trades performance
      for precision. The main motivation behind KFENCE's design, is that with
      enough total uptime KFENCE will detect bugs in code paths not typically
      exercised by non-production test workloads. One way to quickly achieve a
      large enough total uptime is when the tool is deployed across a large
      fleet of machines.
      
      KFENCE objects each reside on a dedicated page, at either the left or
      right page boundaries. The pages to the left and right of the object
      page are "guard pages", whose attributes are changed to a protected
      state, and cause page faults on any attempted access to them. Such page
      faults are then intercepted by KFENCE, which handles the fault
      gracefully by reporting a memory access error.
      
      Guarded allocations are set up based on a sample interval (can be set
      via kfence.sample_interval). After expiration of the sample interval,
      the next allocation through the main allocator (SLAB or SLUB) returns a
      guarded allocation from the KFENCE object pool. At this point, the timer
      is reset, and the next allocation is set up after the expiration of the
      interval.
      
      To enable/disable a KFENCE allocation through the main allocator's
      fast-path without overhead, KFENCE relies on static branches via the
      static keys infrastructure. The static branch is toggled to redirect the
      allocation to KFENCE.
      
      The KFENCE memory pool is of fixed size, and if the pool is exhausted no
      further KFENCE allocations occur. The default config is conservative
      with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
      pages).
      
      We have verified by running synthetic benchmarks (sysbench I/O,
      hackbench) and production server-workload benchmarks that a kernel with
      KFENCE (using sample intervals 100-500ms) is performance-neutral
      compared to a non-KFENCE baseline kernel.
      
      KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
      properties. The name "KFENCE" is a homage to the Electric Fence Malloc
      Debugger [2].
      
      For more details, see Documentation/dev-tools/kfence.rst added in the
      series -- also viewable here:
      
      	https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst
      
      [1] http://llvm.org/docs/GwpAsan.html
      [2] https://linux.die.net/man/3/efence
      
      This patch (of 9):
      
      This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
      low-overhead sampling-based memory safety error detector of heap
      use-after-free, invalid-free, and out-of-bounds access errors.
      
      KFENCE is designed to be enabled in production kernels, and has near
      zero performance overhead. Compared to KASAN, KFENCE trades performance
      for precision. The main motivation behind KFENCE's design, is that with
      enough total uptime KFENCE will detect bugs in code paths not typically
      exercised by non-production test workloads. One way to quickly achieve a
      large enough total uptime is when the tool is deployed across a large
      fleet of machines.
      
      KFENCE objects each reside on a dedicated page, at either the left or
      right page boundaries. The pages to the left and right of the object
      page are "guard pages", whose attributes are changed to a protected
      state, and cause page faults on any attempted access to them. Such page
      faults are then intercepted by KFENCE, which handles the fault
      gracefully by reporting a memory access error. To detect out-of-bounds
      writes to memory within the object's page itself, KFENCE also uses
      pattern-based redzones. The following figure illustrates the page
      layout:
      
        ---+-----------+-----------+-----------+-----------+-----------+---
           | xxxxxxxxx | O :       | xxxxxxxxx |       : O | xxxxxxxxx |
           | xxxxxxxxx | B :       | xxxxxxxxx |       : B | xxxxxxxxx |
           | x GUARD x | J : RED-  | x GUARD x | RED-  : J | x GUARD x |
           | xxxxxxxxx | E :  ZONE | xxxxxxxxx |  ZONE : E | xxxxxxxxx |
           | xxxxxxxxx | C :       | xxxxxxxxx |       : C | xxxxxxxxx |
           | xxxxxxxxx | T :       | xxxxxxxxx |       : T | xxxxxxxxx |
        ---+-----------+-----------+-----------+-----------+-----------+---
      
      Guarded allocations are set up based on a sample interval (can be set
      via kfence.sample_interval). After expiration of the sample interval, a
      guarded allocation from the KFENCE object pool is returned to the main
      allocator (SLAB or SLUB). At this point, the timer is reset, and the
      next allocation is set up after the expiration of the interval.
      
      To enable/disable a KFENCE allocation through the main allocator's
      fast-path without overhead, KFENCE relies on static branches via the
      static keys infrastructure. The static branch is toggled to redirect the
      allocation to KFENCE. To date, we have verified by running synthetic
      benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
      is performance-neutral compared to the non-KFENCE baseline.
      
      For more details, see Documentation/dev-tools/kfence.rst (added later in
      the series).
      
      [elver@google.com: fix parameter description for kfence_object_start()]
        Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
      [elver@google.com: avoid stalling work queue task without allocations]
        Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
        Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
      [elver@google.com: fix potential deadlock due to wake_up()]
        Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
        Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
      [elver@google.com: add option to use KFENCE without static keys]
        Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
      [elver@google.com: add missing copyright and description headers]
        Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com
      
      Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NSeongJae Park <sjpark@amazon.de>
      Co-developed-by: NMarco Elver <elver@google.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce20dd8
  24. 16 2月, 2021 1 次提交
  25. 06 2月, 2021 1 次提交
  26. 09 1月, 2021 1 次提交
  27. 16 12月, 2020 3 次提交
    • V
      mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters · 04013513
      Vlastimil Babka 提交于
      Patch series "cleanup page poisoning", v3.
      
      I have identified a number of issues and opportunities for cleanup with
      CONFIG_PAGE_POISON and friends:
      
       - interaction with init_on_alloc and init_on_free parameters depends on
         the order of parameters (Patch 1)
      
       - the boot time enabling uses static key, but inefficienty (Patch 2)
      
       - sanity checking is incompatible with hibernation (Patch 3)
      
       - CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
         init_on_free (Patch 4)
      
       - CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
         have init_on_free (Patch 5)
      
      This patch (of 5):
      
      Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
      produces a warning in dmesg that page_poison takes precedence.  However,
      as these warnings are printed in early_param handlers for
      init_on_alloc/free, they are not printed if page_poison is enabled later
      on the command line (handlers are called in the order of their
      parameters), or when init_on_alloc/free is always enabled by the
      respective config option - before the page_poison early param handler is
      called, it is not considered to be enabled.  This is inconsistent.
      
      We can remove the dependency on order by making the init_on_* parameters
      only set a boolean variable, and postponing the evaluation after all early
      params have been processed.  Introduce a new
      init_mem_debugging_and_hardening() function for that, and move the related
      debug_pagealloc processing there as well.
      
      As a result init_mem_debugging_and_hardening() knows always accurately if
      init_on_* and/or page_poison options were enabled.  Thus we can also
      optimize want_init_on_alloc() and want_init_on_free().  We don't need to
      check page_poisoning_enabled() there, we can instead not enable the
      init_on_* static keys at all, if page poisoning is enabled.  This results
      in a simpler and more effective code.
      
      Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
      Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mateusz Nosek <mateusznosek0@gmail.com>
      Cc: Laura Abbott <labbott@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04013513
    • L
      init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT set · ba8f3587
      Lin Feng 提交于
      In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set,
      we have following callchain:
      
      start_kernel
      ...
        mm_init
          mem_init
           memblock_free_all
             reset_all_zones_managed_pages
             free_low_memory_core_early
      ...
        buffer_init
          nr_free_buffer_pages
            zone->managed_pages
      ...
        rest_init
          kernel_init
            kernel_init_freeable
              page_alloc_init_late
                kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
                wait_for_completion(&pgdat_init_all_done_comp);
                ...
                files_maxfiles_init
      
      It's clear that buffer_init depends on zone->managed_pages, but it's reset
      in reset_all_zones_managed_pages after that pages are readded into
      zone->managed_pages, but when buffer_init runs this process is half done
      and most of them will finally be added till deferred_init_memmap done.  In
      large memory couting of nr_free_buffer_pages drifts too much, also
      drifting from kernels to kernels on same hardware.
      
      Fix is simple, it delays buffer_init run till deferred_init_memmap all
      done.
      
      But as corrected by this patch, max_buffer_heads becomes very large, the
      value is roughly as many as 4 times of totalram_pages, formula:
      max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct
      buffer_head));
      
      Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads
      turns out to be roughly 67,108,864.  In common cases, should a buffer_head
      be mapped to one page/block(4KB)?  So max_buffer_heads never exceeds
      totalram_pages.  IMO it's likely to make buffer_heads_over_limit bool
      value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in
      vmscan unnecessary.
      
      So this patch will change the original behavior related to
      buffer_heads_over_limit in vmscan since we used a half done value of
      zone->managed_pages before, or should we use a smaller factor(<10%) in
      previous formula.
      
      akpm: I think this is OK - the max_buffer_heads code is only needed on
      highmem machines, to prevent ZONE_NORMAL from being consumed by large
      amounts of buffer_heads attached to highmem pagecache.  This problem will
      not occur on 64-bit machines, so this feature's non-functionality on such
      machines is a feature, not a bug.
      
      Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.comSigned-off-by: NLin Feng <linf@wangsu.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba8f3587
    • Z
      mm: fix page_owner initializing issue for arm32 · 7fb7ab6d
      Zhenhua Huang 提交于
      Page owner of pages used by page owner itself used is missing on arm32
      targets.  The reason is dummy_handle and failure_handle is not initialized
      correctly.  Buddy allocator is used to initialize these two handles.
      However, buddy allocator is not ready when page owner calls it.  This
      change fixed that by initializing page owner after buddy initialization.
      
      The working flow before and after this change are:
      original logic:
       1. allocated memory for page_ext(using memblock).
       2. invoke the init callback of page_ext_ops like page_owner(using buddy
          allocator).
       3. initialize buddy.
      
      after this change:
       1. allocated memory for page_ext(using memblock).
       2. initialize buddy.
       3. invoke the init callback of page_ext_ops like page_owner(using buddy
          allocator).
      
      with the change, failure/dummy_handle can get its correct value and page
      owner output for example has the one for page owner itself:
      
        Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns
        PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0()
          init_page_owner+0x28/0x2f8
          invoke_init_callbacks_flatmem+0x24/0x34
          start_kernel+0x33c/0x5d8
      
      Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.orgSigned-off-by: NZhenhua Huang <zhenhuah@codeaurora.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fb7ab6d