1. 09 9月, 2021 1 次提交
  2. 04 9月, 2021 2 次提交
  3. 26 8月, 2021 1 次提交
  4. 17 8月, 2021 1 次提交
  5. 10 8月, 2021 1 次提交
  6. 24 7月, 2021 3 次提交
  7. 30 6月, 2021 1 次提交
    • A
      mm: gup: pack has_pinned in MMF_HAS_PINNED · a458b76a
      Andrea Arcangeli 提交于
      has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
      cleanup.
      
      Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
      would reintroduce a loss of SMP scalability to pin-fast, so there's no
      future potential usefulness to keep an atomic in the mm for this.
      
      set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
      (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
      atomic_set after this commit) has to be still issued only once per "mm",
      so the difference between the two will be lost in the noise.
      
      will-it-scale "mmap2" shows no change in performance with enterprise
      config as expected.
      
      will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
      improvement against upstream as expected.
      
      This is a noop as far as overall performance and SMP scalability are
      concerned.
      
      [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
        Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
      [akpm@linux-foundation.org: coding style fixes]
      [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a458b76a
  8. 28 6月, 2021 1 次提交
  9. 24 6月, 2021 1 次提交
  10. 18 6月, 2021 1 次提交
  11. 17 6月, 2021 1 次提交
    • L
      sched/cpufreq: Consider reduced CPU capacity in energy calculation · 8f1b971b
      Lukasz Luba 提交于
      Energy Aware Scheduling (EAS) needs to predict the decisions made by
      SchedUtil. The map_util_freq() exists to do that.
      
      There are corner cases where the max allowed frequency might be reduced
      (due to thermal). SchedUtil as a CPUFreq governor, is aware of that
      but EAS is not. This patch aims to address it.
      
      SchedUtil stores the maximum allowed frequency in
      'sugov_policy::next_freq' field. EAS has to predict that value, which is
      the real used frequency. That value is made after a call to
      cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
      In the existing code EAS is not able to predict that real frequency.
      This leads to energy estimation errors.
      
      To avoid wrong energy estimation in EAS (due to frequency miss prediction)
      make sure that the step which calculates Performance Domain frequency,
      is also aware of the allowed CPU capacity.
      
      Furthermore, modify map_util_freq() to not extend the frequency value.
      Instead, use map_util_perf() to extend the util value in both places:
      SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
      In the end, we achieve the same desirable behavior for both subsystems
      and alignment in regards to the real CPU frequency.
      Signed-off-by: NLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
      Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
      8f1b971b
  12. 19 5月, 2021 2 次提交
  13. 13 5月, 2021 3 次提交
  14. 12 5月, 2021 1 次提交
  15. 06 5月, 2021 2 次提交
    • P
      mm: honor PF_MEMALLOC_PIN for all movable pages · 8e3560d9
      Pavel Tatashin 提交于
      PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
      for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
      gfp_mask when this flag is passed in the current context.
      
      Add is_pinnable_page() to return true if page is in a pinnable page.  A
      pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e3560d9
    • P
      mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN · 1a08ae36
      Pavel Tatashin 提交于
      PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
      return pages that might belong to CMA region.  This is currently used
      for long term gup to make sure that such pins are not going to be done
      on any CMA pages.
      
      When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
      is focusing on CMA pages too much and that there is larger class of
      pages that need the same treatment.  MOVABLE zone cannot contain any
      long term pins as well so it makes sense to reuse and redefine this flag
      for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
      defines an allocation context which can only get pages suitable for
      long-term pins.
      
      Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
      memalloc_pin_save()/memalloc_pin_restore() and make the new functions
      common.
      
      [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
        Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a08ae36
  16. 01 5月, 2021 4 次提交
  17. 21 4月, 2021 1 次提交
  18. 16 4月, 2021 1 次提交
  19. 16 3月, 2021 1 次提交
    • A
      fanotify: configurable limits via sysfs · 5b8fea65
      Amir Goldstein 提交于
      fanotify has some hardcoded limits. The only APIs to escape those limits
      are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS.
      
      Allow finer grained tuning of the system limits via sysfs tunables under
      /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify,
      with some minor differences.
      
      - max_queued_events - global system tunable for group queue size limit.
        Like the inotify tunable with the same name, it defaults to 16384 and
        applies on initialization of a new group.
      
      - max_user_marks - user ns tunable for marks limit per user.
        Like the inotify tunable named max_user_watches, on a machine with
        sufficient RAM and it defaults to 1048576 in init userns and can be
        further limited per containing user ns.
      
      - max_user_groups - user ns tunable for number of groups per user.
        Like the inotify tunable named max_user_instances, it defaults to 128
        in init userns and can be further limited per containing user ns.
      
      The slightly different tunable names used for fanotify are derived from
      the "group" and "mark" terminology used in the fanotify man pages and
      throughout the code.
      
      Considering the fact that the default value for max_user_instances was
      increased in kernel v5.10 from 8192 to 1048576, leaving the legacy
      fanotify limit of 8192 marks per group in addition to the max_user_marks
      limit makes little sense, so the per group marks limit has been removed.
      
      Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own
      marks are not accounted in the per user marks account, so in effect the
      limit of max_user_marks is only for the collection of groups that are
      not initialized with FAN_UNLIMITED_MARKS.
      
      Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.comSuggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      5b8fea65
  20. 14 3月, 2021 1 次提交
  21. 05 3月, 2021 1 次提交
    • J
      kernel: provide create_io_thread() helper · cc440e87
      Jens Axboe 提交于
      Provide a generic helper for setting up an io_uring worker. Returns a
      task_struct so that the caller can do whatever setup is needed, then call
      wake_up_new_task() to kick it into gear.
      
      Add a kernel_clone_args member, io_thread, which tells copy_process() to
      mark the task with PF_IO_WORKER.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc440e87
  22. 17 2月, 2021 2 次提交
  23. 16 12月, 2020 2 次提交
    • D
      mm: extract might_alloc() debug check · 95d6c701
      Daniel Vetter 提交于
      Extracted from slab.h, which seems to have the most complete version
      including the correct might_sleep() check.  Roll it out to slob.c.
      
      Motivated by a discussion with Paul about possibly changing call_rcu
      behaviour to allocate memory, but only roughly every 500th call.
      
      There are a lot fewer places in the kernel that care about whether
      allocating memory is allowed or not (due to deadlocks with reclaim code)
      than places that care whether sleeping is allowed.  But debugging these
      also tends to be a lot harder, so nice descriptive checks could come in
      handy.  I might have some use eventually for annotations in drivers/gpu.
      
      Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
      not consult the PF_MEMALLOC flags.  But there is no flag equivalent for
      GFP_NOWAIT, hence this check can't go wrong due to
      memalloc_no*_save/restore contexts.  Willy is working on a patch series
      which might change this:
      
      https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/
      
      I think best would be if that updates gfpflags_allow_blocking(), since
      there's a ton of callers all over the place for that already.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95d6c701
    • R
      cpufreq: Add special-purpose fast-switching callback for drivers · ee2cc427
      Rafael J. Wysocki 提交于
      First off, some cpufreq drivers (eg. intel_pstate) can pass hints
      beyond the current target frequency to the hardware and there are no
      provisions for doing that in the cpufreq framework.  In particular,
      today the driver has to assume that it should not allow the frequency
      to fall below the one requested by the governor (or the required
      capacity may not be provided) which may not be the case and which may
      lead to excessive energy usage in some scenarios.
      
      Second, the hints passed by these drivers to the hardware need not be
      in terms of the frequency, so representing the utilization numbers
      coming from the scheduler as frequency before passing them to those
      drivers is not really useful.
      
      Address the two points above by adding a special-purpose replacement
      for the ->fast_switch callback, called ->adjust_perf, allowing the
      governor to pass abstract performance level (rather than frequency)
      values for the minimum (required) and target (desired) performance
      along with the CPU capacity to compare them to.
      
      Also update the schedutil governor to use the new callback instead
      of ->fast_switch if present and if the utilization mertics are
      frequency-invariant (that is requisite for the direct mapping
      between the utilization and the CPU performance levels to be a
      reasonable approximation).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      ee2cc427
  24. 13 12月, 2020 2 次提交
  25. 11 12月, 2020 1 次提交
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman 提交于
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
  26. 19 11月, 2020 1 次提交
  27. 11 11月, 2020 1 次提交