1. 29 1月, 2021 1 次提交
  2. 09 1月, 2021 1 次提交
  3. 23 12月, 2020 1 次提交
  4. 16 12月, 2020 3 次提交
    • V
      mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters · 04013513
      Vlastimil Babka 提交于
      Patch series "cleanup page poisoning", v3.
      
      I have identified a number of issues and opportunities for cleanup with
      CONFIG_PAGE_POISON and friends:
      
       - interaction with init_on_alloc and init_on_free parameters depends on
         the order of parameters (Patch 1)
      
       - the boot time enabling uses static key, but inefficienty (Patch 2)
      
       - sanity checking is incompatible with hibernation (Patch 3)
      
       - CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
         init_on_free (Patch 4)
      
       - CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
         have init_on_free (Patch 5)
      
      This patch (of 5):
      
      Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
      produces a warning in dmesg that page_poison takes precedence.  However,
      as these warnings are printed in early_param handlers for
      init_on_alloc/free, they are not printed if page_poison is enabled later
      on the command line (handlers are called in the order of their
      parameters), or when init_on_alloc/free is always enabled by the
      respective config option - before the page_poison early param handler is
      called, it is not considered to be enabled.  This is inconsistent.
      
      We can remove the dependency on order by making the init_on_* parameters
      only set a boolean variable, and postponing the evaluation after all early
      params have been processed.  Introduce a new
      init_mem_debugging_and_hardening() function for that, and move the related
      debug_pagealloc processing there as well.
      
      As a result init_mem_debugging_and_hardening() knows always accurately if
      init_on_* and/or page_poison options were enabled.  Thus we can also
      optimize want_init_on_alloc() and want_init_on_free().  We don't need to
      check page_poisoning_enabled() there, we can instead not enable the
      init_on_* static keys at all, if page poisoning is enabled.  This results
      in a simpler and more effective code.
      
      Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
      Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mateusz Nosek <mateusznosek0@gmail.com>
      Cc: Laura Abbott <labbott@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04013513
    • L
      init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT set · ba8f3587
      Lin Feng 提交于
      In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set,
      we have following callchain:
      
      start_kernel
      ...
        mm_init
          mem_init
           memblock_free_all
             reset_all_zones_managed_pages
             free_low_memory_core_early
      ...
        buffer_init
          nr_free_buffer_pages
            zone->managed_pages
      ...
        rest_init
          kernel_init
            kernel_init_freeable
              page_alloc_init_late
                kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
                wait_for_completion(&pgdat_init_all_done_comp);
                ...
                files_maxfiles_init
      
      It's clear that buffer_init depends on zone->managed_pages, but it's reset
      in reset_all_zones_managed_pages after that pages are readded into
      zone->managed_pages, but when buffer_init runs this process is half done
      and most of them will finally be added till deferred_init_memmap done.  In
      large memory couting of nr_free_buffer_pages drifts too much, also
      drifting from kernels to kernels on same hardware.
      
      Fix is simple, it delays buffer_init run till deferred_init_memmap all
      done.
      
      But as corrected by this patch, max_buffer_heads becomes very large, the
      value is roughly as many as 4 times of totalram_pages, formula:
      max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct
      buffer_head));
      
      Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads
      turns out to be roughly 67,108,864.  In common cases, should a buffer_head
      be mapped to one page/block(4KB)?  So max_buffer_heads never exceeds
      totalram_pages.  IMO it's likely to make buffer_heads_over_limit bool
      value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in
      vmscan unnecessary.
      
      So this patch will change the original behavior related to
      buffer_heads_over_limit in vmscan since we used a half done value of
      zone->managed_pages before, or should we use a smaller factor(<10%) in
      previous formula.
      
      akpm: I think this is OK - the max_buffer_heads code is only needed on
      highmem machines, to prevent ZONE_NORMAL from being consumed by large
      amounts of buffer_heads attached to highmem pagecache.  This problem will
      not occur on 64-bit machines, so this feature's non-functionality on such
      machines is a feature, not a bug.
      
      Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.comSigned-off-by: NLin Feng <linf@wangsu.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba8f3587
    • Z
      mm: fix page_owner initializing issue for arm32 · 7fb7ab6d
      Zhenhua Huang 提交于
      Page owner of pages used by page owner itself used is missing on arm32
      targets.  The reason is dummy_handle and failure_handle is not initialized
      correctly.  Buddy allocator is used to initialize these two handles.
      However, buddy allocator is not ready when page owner calls it.  This
      change fixed that by initializing page owner after buddy initialization.
      
      The working flow before and after this change are:
      original logic:
       1. allocated memory for page_ext(using memblock).
       2. invoke the init callback of page_ext_ops like page_owner(using buddy
          allocator).
       3. initialize buddy.
      
      after this change:
       1. allocated memory for page_ext(using memblock).
       2. initialize buddy.
       3. invoke the init callback of page_ext_ops like page_owner(using buddy
          allocator).
      
      with the change, failure/dummy_handle can get its correct value and page
      owner output for example has the one for page owner itself:
      
        Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns
        PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0()
          init_page_owner+0x28/0x2f8
          invoke_init_callbacks_flatmem+0x24/0x34
          start_kernel+0x33c/0x5d8
      
      Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.orgSigned-off-by: NZhenhua Huang <zhenhuah@codeaurora.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fb7ab6d
  5. 15 12月, 2020 1 次提交
  6. 12 12月, 2020 1 次提交
  7. 11 12月, 2020 1 次提交
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman 提交于
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
  8. 02 12月, 2020 6 次提交
  9. 01 12月, 2020 3 次提交
  10. 21 11月, 2020 1 次提交
  11. 20 11月, 2020 1 次提交
  12. 13 11月, 2020 1 次提交
  13. 03 11月, 2020 1 次提交
  14. 10 10月, 2020 1 次提交
  15. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  16. 25 9月, 2020 1 次提交
  17. 19 9月, 2020 2 次提交
  18. 08 9月, 2020 1 次提交
  19. 05 9月, 2020 1 次提交
  20. 01 9月, 2020 1 次提交
  21. 29 8月, 2020 1 次提交
    • A
      bpf: Introduce sleepable BPF programs · 1e6c62a8
      Alexei Starovoitov 提交于
      Introduce sleepable BPF programs that can request such property for themselves
      via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
      to use helpers like bpf_copy_from_user() that might sleep. At present only
      fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
      when they are attached to kernel functions that are known to allow sleeping.
      
      The non-sleepable programs are relying on implicit rcu_read_lock() and
      migrate_disable() to protect life time of programs, maps that they use and
      per-cpu kernel structures used to pass info between bpf programs and the
      kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
      migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
      should not be enclosed in migrate_disable() as well. Therefore
      rcu_read_lock_trace is used to protect the life time of sleepable progs.
      
      There are many networking and tracing program types. In many cases the
      'struct bpf_prog *' pointer itself is rcu protected within some other kernel
      data structure and the kernel code is using rcu_dereference() to load that
      program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
      Instead sleepable bpf programs are allowed with bpf trampoline only. The
      program pointers are hard-coded into generated assembly of bpf trampoline and
      synchronize_rcu_tasks_trace() is used to protect the life time of the program.
      The same trampoline can hold both sleepable and non-sleepable progs.
      
      When rcu_read_lock_trace is held it means that some sleepable bpf program is
      running from bpf trampoline. Those programs can use bpf arrays and preallocated
      hash/lru maps. These map types are waiting on programs to complete via
      synchronize_rcu_tasks_trace();
      
      Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
      synchronize_rcu_tasks() to wait for sleepable progs to finish and for
      trampoline assembly to finish.
      
      This is the first step of introducing sleepable progs. Eventually dynamically
      allocated hash maps can be allowed and networking program types can become
      sleepable too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
      1e6c62a8
  22. 20 8月, 2020 1 次提交
    • A
      bpf: Add kernel module with user mode driver that populates bpffs. · d71fa5c9
      Alexei Starovoitov 提交于
      Add kernel module with user mode driver that populates bpffs with
      BPF iterators.
      
      $ mount bpffs /my/bpffs/ -t bpf
      $ ls -la /my/bpffs/
      total 4
      drwxrwxrwt  2 root root    0 Jul  2 00:27 .
      drwxr-xr-x 19 root root 4096 Jul  2 00:09 ..
      -rw-------  1 root root    0 Jul  2 00:27 maps.debug
      -rw-------  1 root root    0 Jul  2 00:27 progs.debug
      
      The user mode driver will load BPF Type Formats, create BPF maps, populate BPF
      maps, load two BPF programs, attach them to BPF iterators, and finally send two
      bpf_link IDs back to the kernel.
      The kernel will pin two bpf_links into newly mounted bpffs instance under
      names "progs.debug" and "maps.debug". These two files become human readable.
      
      $ cat /my/bpffs/progs.debug
        id name            attached
        11 dump_bpf_map    bpf_iter_bpf_map
        12 dump_bpf_prog   bpf_iter_bpf_prog
        27 test_pkt_access
        32 test_main       test_pkt_access test_pkt_access
        33 test_subprog1   test_pkt_access_subprog1 test_pkt_access
        34 test_subprog2   test_pkt_access_subprog2 test_pkt_access
        35 test_subprog3   test_pkt_access_subprog3 test_pkt_access
        36 new_get_skb_len get_skb_len test_pkt_access
        37 new_get_skb_ifindex get_skb_ifindex test_pkt_access
        38 new_get_constant get_constant test_pkt_access
      
      The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about
      all BPF programs currently loaded in the system. This information is unstable
      and will change from kernel to kernel as ".debug" suffix conveys.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com
      d71fa5c9
  23. 19 8月, 2020 1 次提交
    • K
      uts: Use generic ns_common::count · 9a56493f
      Kirill Tkhai 提交于
      Switch over uts namespaces to use the newly introduced common lifetime
      counter.
      
      Currently every namespace type has its own lifetime counter which is stored
      in the specific namespace struct. The lifetime counters are used
      identically for all namespaces types. Namespaces may of course have
      additional unrelated counters and these are not altered.
      
      This introduces a common lifetime counter into struct ns_common. The
      ns_common struct encompasses information that all namespaces share. That
      should include the lifetime counter since its common for all of them.
      
      It also allows us to unify the type of the counters across all namespaces.
      Most of them use refcount_t but one uses atomic_t and at least one uses
      kref. Especially the last one doesn't make much sense since it's just a
      wrapper around refcount_t since 2016 and actually complicates cleanup
      operations by having to use container_of() to cast the correct namespace
      struct out of struct ns_common.
      
      Having the lifetime counter for the namespaces in one place reduces
      maintenance cost. Not just because after switching all namespaces over we
      will have removed more code than we added but also because the logic is
      more easily understandable and we indicate to the user that the basic
      lifetime requirements for all namespaces are currently identical.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomainSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9a56493f
  24. 08 8月, 2020 2 次提交
  25. 05 8月, 2020 2 次提交
  26. 04 8月, 2020 1 次提交
    • S
      init: Align init_task to avoid conflict with MUTEX_FLAGS · d0b7213f
      Stafford Horne 提交于
      When booting on 32-bit machines (seen on OpenRISC) I saw this warning
      with CONFIG_DEBUG_MUTEXES turned on.
      
          ------------[ cut here ]------------
          WARNING: CPU: 0 PID: 0 at kernel/locking/mutex.c:1242 __mutex_unlock_slowpath+0x328/0x3ec
          DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
          Modules linked in:
          CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-rc1-simple-smp-00005-g2864e2171db4-dirty #179
          Call trace:
          [<(ptrval)>] dump_stack+0x34/0x48
          [<(ptrval)>] __warn+0x104/0x158
          [<(ptrval)>] ? __mutex_unlock_slowpath+0x328/0x3ec
          [<(ptrval)>] warn_slowpath_fmt+0x7c/0x94
          [<(ptrval)>] __mutex_unlock_slowpath+0x328/0x3ec
          [<(ptrval)>] mutex_unlock+0x18/0x28
          [<(ptrval)>] __cpuhp_setup_state_cpuslocked.part.0+0x29c/0x2f4
          [<(ptrval)>] ? page_alloc_cpu_dead+0x0/0x30
          [<(ptrval)>] ? start_kernel+0x0/0x684
          [<(ptrval)>] __cpuhp_setup_state+0x4c/0x5c
          [<(ptrval)>] page_alloc_init+0x34/0x68
          [<(ptrval)>] ? start_kernel+0x1a0/0x684
          [<(ptrval)>] ? early_init_dt_scan_nodes+0x60/0x70
          irq event stamp: 0
      
      I traced this to kernel/locking/mutex.c storing 3 bits of MUTEX_FLAGS in
      the task_struct pointer (mutex.owner).  There is a comment saying that
      task_structs are always aligned to L1_CACHE_BYTES.  This is not true for
      the init_task.
      
      On 64-bit machines this is not a problem because symbol addresses are
      naturally aligned to 64-bits providing 3 bits for MUTEX_FLAGS.  Howerver,
      for 32-bit machines the symbol address only has 2 bits available.
      
      Fix this by setting init_task alignment to at least L1_CACHE_BYTES.
      Signed-off-by: NStafford Horne <shorne@gmail.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d0b7213f
  27. 31 7月, 2020 2 次提交