1. 24 11月, 2022 1 次提交
  2. 26 7月, 2022 1 次提交
  3. 09 4月, 2021 1 次提交
  4. 19 10月, 2020 2 次提交
    • R
      mm: kmem: prepare remote memcg charging infra for interrupt contexts · 37d5985c
      Roman Gushchin 提交于
      Remote memcg charging API uses current->active_memcg to store the
      currently active memory cgroup, which overwrites the memory cgroup of the
      current process.  It works well for normal contexts, but doesn't work for
      interrupt contexts: indeed, if an interrupt occurs during the execution of
      a section with an active memcg set, all allocations inside the interrupt
      will be charged to the active memcg set (given that we'll enable
      accounting for allocations from an interrupt context).  But because the
      interrupt might have no relation to the active memcg set outside, it's
      obviously wrong from the accounting prospective.
      
      To resolve this problem, let's add a global percpu int_active_memcg
      variable, which will be used to store an active memory cgroup which will
      be used from interrupt contexts.  set_active_memcg() will transparently
      use current->active_memcg or int_active_memcg depending on the context.
      
      To make the read part simple and transparent for the caller, let's
      introduce two new functions:
        - struct mem_cgroup *active_memcg(void),
        - struct mem_cgroup *get_active_memcg(void).
      
      They are returning the active memcg if it's set, hiding all implementation
      details: where to get it depending on the current context.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d5985c
    • R
      mm, memcg: rework remote charging API to support nesting · b87d8cef
      Roman Gushchin 提交于
      Currently the remote memcg charging API consists of two functions:
      memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
      memcg value, which overwrites the memcg of the current task.
      
        memalloc_use_memcg(target_memcg);
        <...>
        memalloc_unuse_memcg();
      
      It works perfectly for allocations performed from a normal context,
      however an attempt to call it from an interrupt context or just nest two
      remote charging blocks will lead to an incorrect accounting.  On exit from
      the inner block the active memcg will be cleared instead of being
      restored.
      
        memalloc_use_memcg(target_memcg);
      
        memalloc_use_memcg(target_memcg_2);
          <...>
          memalloc_unuse_memcg();
      
          Error: allocation here are charged to the memcg of the current
          process instead of target_memcg.
      
        memalloc_unuse_memcg();
      
      This patch extends the remote charging API by switching to a single
      function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
      which sets the new value and returns the old one.  So a remote charging
      block will look like:
      
        old_memcg = set_active_memcg(target_memcg);
        <...>
        set_active_memcg(old_memcg);
      
      This patch is heavily based on the patch by Johannes Weiner, which can be
      found here: https://lkml.org/lkml/2020/5/28/806 .
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Schatzberg <dschatzberg@fb.com>
      Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b87d8cef
  5. 17 10月, 2020 1 次提交
  6. 25 9月, 2020 1 次提交
  7. 13 8月, 2020 1 次提交
    • W
      include/linux/sched/mm.h: optimize current_gfp_context() · af161bee
      Waiman Long 提交于
      The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
      flags into the corresponding GFP_* flags for memory allocation.  In that
      function, current->flags is accessed 3 times.  That may lead to duplicated
      access of the same memory location.
      
      This is not usually a problem with minimal debug config options on as the
      compiler can optimize away the duplicated memory accesses.  With most of
      the debug config options on, however, that may not be the case.  For
      example, the x86-64 object size of the __need_fs_reclaim() in a debug
      kernel that calls current_gfp_context() was 309 bytes.  With this patch
      applied, the object size is reduced to 202 bytes.  This is a saving of 107
      bytes and will probably be slightly faster too.
      
      Use READ_ONCE() to access current->flags to prevent the compiler from
      possibly accessing current->flags multiple times.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af161bee
  8. 08 8月, 2020 1 次提交
    • J
      mm/page_alloc: fix memalloc_nocma_{save/restore} APIs · 8510e69c
      Joonsoo Kim 提交于
      Currently, memalloc_nocma_{save/restore} API that prevents CMA area
      in page allocation is implemented by using current_gfp_context(). However,
      there are two problems of this implementation.
      
      First, this doesn't work for allocation fastpath. In the fastpath,
      original gfp_mask is used since current_gfp_context() is introduced in
      order to control reclaim and it is on slowpath. So, CMA area can be
      allocated through the allocation fastpath even if
      memalloc_nocma_{save/restore} APIs are used. Currently, there is just
      one user for these APIs and it has a fallback method to prevent actual
      problem.
      Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
      to exclude the memory on the ZONE_MOVABLE for allocation target.
      
      To fix these problems, this patch changes the implementation to exclude
      CMA area in page allocation. Main point of this change is using the
      alloc_flags. alloc_flags is mainly used to control allocation so it fits
      for excluding CMA area in allocation.
      
      Fixes: d7fefcc8 (mm/cma: add PF flag to force non cma alloc)
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8510e69c
  9. 22 7月, 2020 1 次提交
  10. 25 6月, 2020 1 次提交
  11. 10 6月, 2020 1 次提交
  12. 01 5月, 2020 1 次提交
    • P
      sched/core: Fix illegal RCU from offline CPUs · bf2c59fc
      Peter Zijlstra 提交于
      In the CPU-offline process, it calls mmdrop() after idle entry and the
      subsequent call to cpuhp_report_idle_dead(). Once execution passes the
      call to rcu_report_dead(), RCU is ignoring the CPU, which results in
      lockdep complaining when mmdrop() uses RCU from either memcg or
      debugobjects below.
      
      Fix it by cleaning up the active_mm state from BP instead. Every arch
      which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
      from AP. The only exception is parisc because it switches them to
      &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
      but the patch will still work there because it calls mmgrab(&init_mm) in
      smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
      
        WARNING: suspicious RCU usage
        -----------------------------
        kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
      
        other info that might help us debug this:
      
        RCU used illegally from offline CPU!
        Call Trace:
         dump_stack+0xf4/0x164 (unreliable)
         lockdep_rcu_suspicious+0x140/0x164
         get_work_pool+0x110/0x150
         __queue_work+0x1bc/0xca0
         queue_work_on+0x114/0x120
         css_release+0x9c/0xc0
         percpu_ref_put_many+0x204/0x230
         free_pcp_prepare+0x264/0x570
         free_unref_page+0x38/0xf0
         __mmdrop+0x21c/0x2c0
         idle_task_exit+0x170/0x1b0
         pnv_smp_cpu_kill_self+0x38/0x2e0
         cpu_die+0x48/0x64
         arch_cpu_idle_dead+0x30/0x50
         do_idle+0x2f4/0x470
         cpu_startup_entry+0x38/0x40
         start_secondary+0x7a8/0xa80
         start_secondary_resume+0x10/0x14
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
      bf2c59fc
  13. 20 11月, 2019 1 次提交
  14. 25 9月, 2019 2 次提交
    • M
      sched/membarrier: Fix p->mm->membarrier_state racy load · 227a4aad
      Mathieu Desnoyers 提交于
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      227a4aad
    • M
      sched/membarrier: Call sync_core only before usermode for same mm · 2840cf02
      Mathieu Desnoyers 提交于
      When the prev and next task's mm change, switch_mm() provides the core
      serializing guarantees before returning to usermode. The only case
      where an explicit core serialization is needed is when the scheduler
      keeps the same mm for prev and next.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2840cf02
  15. 14 6月, 2019 1 次提交
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli 提交于
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
  16. 20 4月, 2019 1 次提交
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli 提交于
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e
  17. 06 3月, 2019 1 次提交
    • A
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V 提交于
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8
  18. 03 12月, 2018 1 次提交
    • I
      sched: Fix various typos in comments · dfcb245e
      Ingo Molnar 提交于
      Go over the scheduler source code and fix common typos
      in comments - and a typo in an actual variable name.
      
      No change in functionality intended.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfcb245e
  19. 18 8月, 2018 1 次提交
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  20. 08 6月, 2018 1 次提交
  21. 29 5月, 2018 1 次提交
  22. 17 4月, 2018 1 次提交
  23. 12 4月, 2018 1 次提交
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
  24. 22 2月, 2018 1 次提交
  25. 06 2月, 2018 4 次提交
    • M
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers 提交于
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70216e18
    • M
      membarrier: Provide GLOBAL_EXPEDITED command · c5f58bd5
      Mathieu Desnoyers 提交于
      Allow expedited membarrier to be used for data shared between processes
      through shared memory.
      
      Processes wishing to receive the membarriers register with
      MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
      membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This allows extremely simple kernel-level implementation: we have almost
      everything we need with the PRIVATE_EXPEDITED barrier code. All we need
      to do is to add a flag in the mm_struct that will be used to check
      whether we need to send the IPI to the current thread of each CPU.
      
      There is a slight downside to this approach compared to targeting
      specific shared memory users: when performing a membarrier operation,
      all registered "global" receivers will get the barrier, even if they
      don't share a memory mapping with the sender issuing
      MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This registration approach seems to fit the requirement of not
      disturbing processes that really deeply care about real-time: they
      simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
      
      In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
      command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
      MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
      compatibility.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5f58bd5
    • M
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers 提交于
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      306e0604
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  26. 01 2月, 2018 1 次提交
  27. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  28. 20 10月, 2017 1 次提交
    • M
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers 提交于
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a961e409
  29. 04 10月, 2017 1 次提交
  30. 07 9月, 2017 1 次提交
  31. 10 8月, 2017 1 次提交
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
  32. 09 5月, 2017 1 次提交
  33. 04 5月, 2017 1 次提交
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
  34. 03 3月, 2017 2 次提交