1. 28 1月, 2020 1 次提交
  2. 25 1月, 2020 1 次提交
  3. 23 1月, 2020 1 次提交
  4. 21 1月, 2020 1 次提交
  5. 20 1月, 2020 1 次提交
  6. 14 1月, 2020 14 次提交
    • J
      mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier · 5292e24a
      Jason Gunthorpe 提交于
      The 'interval_sub' is placed on the 'notifier_subscriptions' interval
      tree.
      
      This eliminates the poor name 'mni' for this variable.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      5292e24a
    • J
      mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier · 1991722a
      Jason Gunthorpe 提交于
      The 'subscription' is placed on the 'notifier_subscriptions' list.
      
      This eliminates the poor name 'mn' for this variable.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      1991722a
    • J
      mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions · 984cfe4e
      Jason Gunthorpe 提交于
      The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
      and is difficult to abbreviate. The struct is actually holding the
      interval tree and hlist containing the notifiers subscribed to a mm.
      
      Use 'subscriptions' as the variable name for this struct instead of the
      really terrible and misleading 'mmn_mm'.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      984cfe4e
    • D
      x86/vdso: Handle faults on timens page · af34ebeb
      Dmitry Safonov 提交于
      If a task belongs to a time namespace then the VVAR page which contains
      the system wide VDSO data is replaced with a namespace specific page
      which has the same layout as the VVAR page.
      Co-developed-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com
      
      af34ebeb
    • A
      mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210
      Adrian Huang 提交于
      When booting with amd_iommu=off, the following WARNING message
      appears:
      
        AMD-Vi: AMD IOMMU disabled on kernel command-line
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
        Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
        RIP: 0010:flush_workqueue+0x42e/0x450
        Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
        Call Trace:
         kmem_cache_destroy+0x69/0x260
         iommu_go_to_state+0x40c/0x5ab
         amd_iommu_prepare+0x16/0x2a
         irq_remapping_prepare+0x36/0x5f
         enable_IR_x2apic+0x21/0x172
         default_setup_apic_routing+0x12/0x6f
         apic_intr_mode_init+0x1a1/0x1f1
         x86_late_time_init+0x17/0x1c
         start_kernel+0x480/0x53f
         secondary_startup_64+0xb6/0xc0
        ---[ end trace 30894107c3749449 ]---
        x2apic: IRQ remapping doesn't support X2APIC mode
        x2apic disabled
      
      The warning is caused by the calling of 'kmem_cache_destroy()'
      in free_iommu_resources(). Here is the call path:
      
        free_iommu_resources
          kmem_cache_destroy
            flush_memcg_workqueue
              flush_workqueue
      
      The root cause is that the IOMMU subsystem runs before the workqueue
      subsystem, which the variable 'wq_online' is still 'false'.  This leads
      to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
      'true'.
      
      Since the variable 'memcg_kmem_cache_wq' is not allocated during the
      time, it is unnecessary to call flush_memcg_workqueue().  This prevents
      the WARNING message triggered by flush_workqueue().
      
      Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: NAdrian Huang <ahuang12@lenovo.com>
      Reported-by: NXiaochun Lee <lixc17@lenovo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fe20210
    • W
      mm/page-writeback.c: improve arithmetic divisions · 0a5d1a7f
      Wen Yang 提交于
      Use div64_ul() instead of do_div() if the divisor is unsigned long, to
      avoid truncation to 32-bit on 64-bit platforms.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.comSigned-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a5d1a7f
    • W
      mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide · d3ac946e
      Wen Yang 提交于
      The two variables 'numerator' and 'denominator', though they are
      declared as long, they should actually be unsigned long (according to
      the implementation of the fprop_fraction_percpu() function)
      
      And do_div() does a 64-by-32 division, while the divisor 'denominator'
      is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
      function to call is div64_ul().
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.comSigned-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3ac946e
    • W
      mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() · 6d9e8c65
      Wen Yang 提交于
      Patch series "use div64_ul() instead of div_u64() if the divisor is
      unsigned long".
      
      We were first inspired by commit b0ab99e7 ("sched: Fix possible divide
      by zero in avg_atom () calculation"), then refer to the recently analyzed
      mm code, we found this suspicious place.
      
       201                 if (min) {
       202                         min *= this_bw;
       203                         do_div(min, tot_bw);
       204                 }
      
      And we also disassembled and confirmed it:
      
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
        0xffffffff811c37da <__wb_calc_thresh+234>:      xor    %r10d,%r10d
        0xffffffff811c37dd <__wb_calc_thresh+237>:      test   %rax,%rax
        0xffffffff811c37e0 <__wb_calc_thresh+240>:      je 0xffffffff811c3800 <__wb_calc_thresh+272>
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
        0xffffffff811c37e2 <__wb_calc_thresh+242>:      imul   %r8,%rax
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
        0xffffffff811c37e6 <__wb_calc_thresh+246>:      mov    %r9d,%r10d    ---> truncates it to 32 bits here
        0xffffffff811c37e9 <__wb_calc_thresh+249>:      xor    %edx,%edx
        0xffffffff811c37eb <__wb_calc_thresh+251>:      div    %r10
        0xffffffff811c37ee <__wb_calc_thresh+254>:      imul   %rbx,%rax
        0xffffffff811c37f2 <__wb_calc_thresh+258>:      shr    $0x2,%rax
        0xffffffff811c37f6 <__wb_calc_thresh+262>:      mul    %rcx
        0xffffffff811c37f9 <__wb_calc_thresh+265>:      shr    $0x2,%rdx
        0xffffffff811c37fd <__wb_calc_thresh+269>:      mov    %rdx,%r10
      
      This series uses div64_ul() instead of div_u64() if the divisor is
      unsigned long, to avoid truncation to 32-bit on 64-bit platforms.
      
      This patch (of 3):
      
      The variables 'min' and 'max' are unsigned long and do_div truncates
      them to 32 bits, which means it can test non-zero and be truncated to
      zero for division.  Fix this issue by using div64_ul() instead.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
      Fixes: 693108a8 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
      Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d9e8c65
    • V
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka 提交于
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
    • R
      mm: memcg/slab: fix percpu slab vmstats flushing · 4a87e2a2
      Roman Gushchin 提交于
      Currently slab percpu vmstats are flushed twice: during the memcg
      offlining and just before freeing the memcg structure.  Each time percpu
      counters are summed, added to the atomic counterparts and propagated up
      by the cgroup tree.
      
      The second flushing is required due to how recursive vmstats are
      implemented: counters are batched in percpu variables on a local level,
      and once a percpu value is crossing some predefined threshold, it spills
      over to atomic values on the local and each ascendant levels.  It means
      that without flushing some numbers cached in percpu variables will be
      dropped on floor each time a cgroup is destroyed.  And with uptime the
      error on upper levels might become noticeable.
      
      The first flushing aims to make counters on ancestor levels more
      precise.  Dying cgroups may resume in the dying state for a long time.
      After kmem_cache reparenting which is performed during the offlining
      slab counters of the dying cgroup don't have any chances to be updated,
      because any slab operations will be performed on the parent level.  It
      means that the inaccuracy caused by percpu batching will not decrease up
      to the final destruction of the cgroup.  By the original idea flushing
      slab counters during the offlining should minimize the visible
      inaccuracy of slab counters on the parent level.
      
      The problem is that percpu counters are not zeroed after the first
      flushing.  So every cached percpu value is summed twice.  It creates a
      small error (up to 32 pages per cpu, but usually less) which accumulates
      on parent cgroup level.  After creating and destroying of thousands of
      child cgroups, slab counter on parent level can be way off the real
      value.
      
      For now, let's just stop flushing slab counters on memcg offlining.  It
      can't be done correctly without scheduling a work on each cpu: reading
      and zeroing it during css offlining can race with an asynchronous
      update, which doesn't expect values to be changed underneath.
      
      With this change, slab counters on parent level will become eventually
      consistent.  Once all dying children are gone, values are correct.  And
      if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
      
      It's not perfect, as slab are reparented, so any updates after the
      reparenting will happen on the parent level.  It means that if a slab
      page was allocated, a counter on child level was bumped, then the page
      was reparented and freed, the annihilation of positive and negative
      counter values will not happen until the child cgroup is released.  It
      makes slab counters different from others, and it might want us to
      implement flushing in a correct form again.  But it's also a question of
      performance: scheduling a work on each cpu isn't free, and it's an open
      question if the benefit of having more accurate counters is worth it.
      
      We might also consider flushing all counters on offlining, not only slab
      counters.
      
      So let's fix the main problem now: make the slab counters eventually
      consistent, so at least the error won't grow with uptime (or more
      precisely the number of created and destroyed cgroups).  And think about
      the accuracy of counters separately.
      
      Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
      Fixes: bee07b33 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a87e2a2
    • K
      mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997
      Kirill A. Shutemov 提交于
      Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
      enabled.  But it doesn't work well with above-47bit hint address.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks THP alignment in shmem/tmp:
      shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
      *any* hint address specified.
      
      This can be fixed by requesting the aligned area if the we failed to
      allocated at user-specified hint address.  The request with inflated
      length will also take the user-specified hint address.  This way we will
      not lose an allocation request from the full address space.
      
      [kirill@shutemov.name: fold in a fixup]
        Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
      Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99158997
    • K
      mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment · 97d3d0f9
      Kirill A. Shutemov 提交于
      Patch series "Fix two above-47bit hint address vs.  THP bugs".
      
      The two get_unmapped_area() implementations have to be fixed to provide
      THP-friendly mappings if above-47bit hint address is specified.
      
      This patch (of 2):
      
      Filesystems use thp_get_unmapped_area() to provide THP-friendly
      mappings.  For DAX in particular.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks thp_get_unmapped_area(): the function
      would not try to allocate PMD-aligned area if *any* hint address
      specified.
      
      Modify the routine to handle it correctly:
      
       - Try to allocate the space at the specified hint address with length
         padding required for PMD alignment.
       - If failed, retry without length padding (but with the same hint
         address);
       - If the returned address matches the hint address return it.
       - Otherwise, align the address as required for THP and return.
      
      The user specified hint address is passed down to get_unmapped_area() so
      above-47bit hint address will be taken into account without breaking
      alignment requirements.
      
      Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NThomas Willhalm <thomas.willhalm@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d3d0f9
    • D
      mm/memory_hotplug: don't free usage map when removing a re-added early section · 8068df3b
      David Hildenbrand 提交于
      When we remove an early section, we don't free the usage map, as the
      usage maps of other sections are placed into the same page.  Once the
      section is removed, it is no longer an early section (especially, the
      memmap is freed).  When we re-add that section, the usage map is reused,
      however, it is no longer an early section.  When removing that section
      again, we try to kfree() a usage map that was allocated during early
      boot - bad.
      
      Let's check against PageReserved() to see if we are dealing with an
      usage map that was allocated during boot.  We could also check against
      !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
      cleaner.
      
      Can be triggered using memtrace under ppc64/powernv:
      
        $ mount -t debugfs none /sys/kernel/debug/
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
         ------------[ cut here ]------------
         kernel BUG at mm/slub.c:3969!
         Oops: Exception in kernel mode, sig: 5 [#1]
         LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
         Modules linked in:
         CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
         NIP kfree+0x338/0x3b0
         LR section_deactivate+0x138/0x200
         Call Trace:
           section_deactivate+0x138/0x200
           __remove_pages+0x114/0x150
           arch_remove_memory+0x3c/0x160
           try_remove_memory+0x114/0x1a0
           __remove_memory+0x20/0x40
           memtrace_enable_set+0x254/0x850
           simple_attr_write+0x138/0x160
           full_proxy_write+0x8c/0x110
           __vfs_write+0x38/0x70
           vfs_write+0x11c/0x2a0
           ksys_write+0x84/0x140
           system_call+0x5c/0x68
         ---[ end trace 4b053cbd84e0db62 ]---
      
      The first invocation will offline+remove memory blocks.  The second
      invocation will first add+online them again, in order to offline+remove
      them again (usually we are lucky and the exact same memory blocks will
      get "reallocated").
      
      Tested on powernv with boot memory: The usage map will not get freed.
      Tested on x86-64 with DIMMs: The usage map will get freed.
      
      Using Dynamic Memory under a Power DLAPR can trigger it easily.
      
      Triggering removal (I assume after previously removed+re-added) of
      memory from the HMC GUI can crash the kernel with the same call trace
      and is fixed by this patch.
      
      Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
      Fixes: 326e1b8f ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NPingfan Liu <piliu@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8068df3b
    • V
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka 提交于
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  7. 07 1月, 2020 1 次提交
    • C
      arm64: Revert support for execute-only user mappings · 24cecc37
      Catalin Marinas 提交于
      The ARMv8 64-bit architecture supports execute-only user permissions by
      clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
      privileged mapping but from which user running at EL0 can still execute.
      
      The downside, however, is that the kernel at EL1 inadvertently reading
      such mapping would not trip over the PAN (privileged access never)
      protection.
      
      Revert the relevant bits from commit cab15ce6 ("arm64: Introduce
      execute-only page access permissions") so that PROT_EXEC implies
      PROT_READ (and therefore PTE_USER) until the architecture gains proper
      support for execute-only user mappings.
      
      Fixes: cab15ce6 ("arm64: Introduce execute-only page access permissions")
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Acked-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24cecc37
  8. 05 1月, 2020 6 次提交
    • W
      mm/hugetlb: defer freeing of huge pages if in non-task context · c77c0a8a
      Waiman Long 提交于
      The following lockdep splat was observed when a certain hugetlbfs test
      was run:
      
        ================================
        WARNING: inconsistent lock state
        4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
        {SOFTIRQ-ON-W} state was registered at:
          lock_acquire+0x14f/0x3b0
          _raw_spin_lock+0x30/0x70
          __nr_hugepages_store_common+0x11b/0xb30
          hugetlb_sysctl_handler_common+0x209/0x2d0
          proc_sys_call_handler+0x37f/0x450
          vfs_write+0x157/0x460
          ksys_write+0xb8/0x170
          do_syscall_64+0xa5/0x4d0
          entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        irq event stamp: 691296
        hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
        hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
        softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
        softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(hugetlb_lock);
          <Interrupt>
            lock(hugetlb_lock);
      
         *** DEADLOCK ***
            :
        Call Trace:
         <IRQ>
         __lock_acquire+0x146b/0x48c0
         lock_acquire+0x14f/0x3b0
         _raw_spin_lock+0x30/0x70
         free_huge_page+0x36f/0xaa0
         bio_check_pages_dirty+0x2fc/0x5c0
         clone_endio+0x17f/0x670 [dm_mod]
         blk_update_request+0x276/0xe50
         scsi_end_request+0x7b/0x6a0
         scsi_io_completion+0x1c6/0x1570
         blk_done_softirq+0x22e/0x350
         __do_softirq+0x23d/0xad8
         irq_exit+0x23e/0x2b0
         do_IRQ+0x11a/0x200
         common_interrupt+0xf/0xf
         </IRQ>
      
      Both the hugetbl_lock and the subpool lock can be acquired in
      free_huge_page().  One way to solve the problem is to make both locks
      irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
      held for a linear scan of ALL hugetlb pages during a cgroup reparentling
      operation.  So it is just too long to have irq disabled unless we can
      break hugetbl_lock down into finer-grained locks with shorter lock hold
      times.
      
      Another alternative is to defer the freeing to a workqueue job.  This
      patch implements the deferred freeing by adding a free_hpage_workfn()
      work function to do the actual freeing.  The free_huge_page() call in a
      non-task context saves the page to be freed in the hpage_freelist linked
      list in a lockless manner using the llist APIs.
      
      The generic workqueue is used to process the work, but a dedicated
      workqueue can be used instead if it is desirable to have the huge page
      freed ASAP.
      
      Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
      llist APIs which simplfy the code.
      
      Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c77c0a8a
    • N
      mm/gup: fix memory leak in __gup_benchmark_ioctl · a7c46c0c
      Navid Emamdoost 提交于
      In the implementation of __gup_benchmark_ioctl() the allocated pages
      should be released before returning in case of an invalid cmd.  Release
      pages via kvfree().
      
      [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
      Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
      Fixes: 714a3a1e ("mm/gup_benchmark.c: add additional pinning methods")
      Signed-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7c46c0c
    • I
      mm/oom: fix pgtables units mismatch in Killed process message · 941f762b
      Ilya Dryomov 提交于
      pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
      As everything else is printed in kB, I chose to fix the value rather than
      the string.
      
      Before:
      
      [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      ...
      [   1878]  1000  1878   217253   151144  1269760        0             0 python
      ...
      Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0
      
      After:
      
      [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      ...
      [   1436]  1000  1436   217253   151890  1294336        0             0 python
      ...
      Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0
      
      Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
      Fixes: 70cb6d26 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Edward Chron <echron@arista.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      941f762b
    • Y
      mm: move_pages: return valid node id in status if the page is already on the target node · e0153fc2
      Yang Shi 提交于
      Felix Abecassis reports move_pages() would return random status if the
      pages are already on the target node by the below test program:
      
        int main(void)
        {
      	const long node_id = 1;
      	const long page_size = sysconf(_SC_PAGESIZE);
      	const int64_t num_pages = 8;
      
      	unsigned long nodemask =  1 << node_id;
      	long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
      	if (ret < 0)
      		return (EXIT_FAILURE);
      
      	void **pages = malloc(sizeof(void*) * num_pages);
      	for (int i = 0; i < num_pages; ++i) {
      		pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
      				MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
      				-1, 0);
      		if (pages[i] == MAP_FAILED)
      			return (EXIT_FAILURE);
      	}
      
      	ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
      	if (ret < 0)
      		return (EXIT_FAILURE);
      
      	int *nodes = malloc(sizeof(int) * num_pages);
      	int *status = malloc(sizeof(int) * num_pages);
      	for (int i = 0; i < num_pages; ++i) {
      		nodes[i] = node_id;
      		status[i] = 0xd0; /* simulate garbage values */
      	}
      
      	ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
      	printf("move_pages: %ld\n", ret);
      	for (int i = 0; i < num_pages; ++i)
      		printf("status[%d] = %d\n", i, status[i]);
        }
      
      Then running the program would return nonsense status values:
      
        $ ./move_pages_bug
        move_pages: 0
        status[0] = 208
        status[1] = 208
        status[2] = 208
        status[3] = 208
        status[4] = 208
        status[5] = 208
        status[6] = 208
        status[7] = 208
      
      This is because the status is not set if the page is already on the
      target node, but move_pages() should return valid status as long as it
      succeeds.  The valid status may be errno or node id.
      
      We can't simply initialize status array to zero since the pages may be
      not on node 0.  Fix it by updating status with node id which the page is
      already on.
      
      Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: a49bd4d7 ("mm, numa: rework do_pages_move")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NFelix Abecassis <fabecassis@nvidia.com>
      Tested-by: NFelix Abecassis <fabecassis@nvidia.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0153fc2
    • C
      mm/zsmalloc.c: fix the migrated zspage statistics. · ac8f05da
      Chanho Min 提交于
      When zspage is migrated to the other zone, the zone page state should be
      updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
      counts including proc/zoneinfo in practice.
      
      Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
      Fixes: 91537fee ("mm: add NR_ZSMALLOC to vmstat")
      Signed-off-by: NChanho Min <chanho.min@lge.com>
      Signed-off-by: NJinsuk Choi <jjinsuk.choi@lge.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>        [4.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac8f05da
    • D
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand 提交于
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  9. 31 12月, 2019 1 次提交
    • J
      x86/kasan: Print original address on #GP · 2f004eea
      Jann Horn 提交于
      Make #GP exceptions caused by out-of-bounds KASAN shadow accesses easier
      to understand by computing the address of the original access and
      printing that. More details are in the comments in the patch.
      
      This turns an error like this:
      
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault, probably for non-canonical address
            0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
      
      into this:
      
        general protection fault, probably for non-canonical address
            0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
        KASAN: maybe wild-memory-access in range
            [0x00badbeefbadbee8-0x00badbeefbadbeef]
      
      The hook is placed in architecture-independent code, but is currently
      only wired up to the X86 exception handler because I'm not sufficiently
      familiar with the address space layout and exception handling mechanisms
      on other architectures.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: kasan-dev@googlegroups.com
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191218231150.12139-4-jannh@google.com
      2f004eea
  10. 18 12月, 2019 5 次提交
    • Y
      mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG · 42a9a53b
      Yang Shi 提交于
      Since commit 0a432dcb ("mm: shrinker: make shrinker not depend on
      memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
      CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
      with CONFIG_MEMCG_KMEM.
      
      And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
      shrinker, and it is deferred_split_shrinker.  But it is never actually
      called, since idr_replace() is never compiled due to the wrong #ifdef.
      The deferred_split_shrinker all the time is staying in half-registered
      state, and it's never called for subordinate mem cgroups.
      
      Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: 0a432dcb ("mm: shrinker: make shrinker not depend on memcg kmem")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[5.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a9a53b
    • D
      kasan: don't assume percpu shadow allocations will succeed · 253a496d
      Daniel Axtens 提交于
      syzkaller and the fault injector showed that I was wrong to assume that
      we could ignore percpu shadow allocation failures.
      
      Handle failures properly.  Merge all the allocated areas back into the
      free list and release the shadow, then clean up and return NULL.  The
      shadow is released unconditionally, which relies upon the fact that the
      release function is able to tolerate pages not being present.
      
      Also clean up shadows in the recovery path - currently they are not
      released, which leaks a bit of memory.
      
      Link: http://lkml.kernel.org/r/20191205140407.1874-3-dja@axtens.net
      Fixes: 3c5c3cfb ("kasan: support backing vmalloc space with real shadow memory")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Reported-by: syzbot+82e323920b78d54aaed5@syzkaller.appspotmail.com
      Reported-by: syzbot+59b7daa4315e07a994f1@syzkaller.appspotmail.com
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      253a496d
    • D
      kasan: use apply_to_existing_page_range() for releasing vmalloc shadow · e218f1ca
      Daniel Axtens 提交于
      kasan_release_vmalloc uses apply_to_page_range to release vmalloc
      shadow.  Unfortunately, apply_to_page_range can allocate memory to fill
      in page table entries, which is not what we want.
      
      Also, kasan_release_vmalloc is called under free_vmap_area_lock, so if
      apply_to_page_range does allocate memory, we get a sleep in atomic bug:
      
      	BUG: sleeping function called from invalid context at mm/page_alloc.c:4681
      	in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 15087, name:
      
      	Call Trace:
      	 __dump_stack lib/dump_stack.c:77 [inline]
      	 dump_stack+0x199/0x216 lib/dump_stack.c:118
      	 ___might_sleep.cold.97+0x1f5/0x238 kernel/sched/core.c:6800
      	 __might_sleep+0x95/0x190 kernel/sched/core.c:6753
      	 prepare_alloc_pages mm/page_alloc.c:4681 [inline]
      	 __alloc_pages_nodemask+0x3cd/0x890 mm/page_alloc.c:4730
      	 alloc_pages_current+0x10c/0x210 mm/mempolicy.c:2211
      	 alloc_pages include/linux/gfp.h:532 [inline]
      	 __get_free_pages+0xc/0x40 mm/page_alloc.c:4786
      	 __pte_alloc_one_kernel include/asm-generic/pgalloc.h:21 [inline]
      	 pte_alloc_one_kernel include/asm-generic/pgalloc.h:33 [inline]
      	 __pte_alloc_kernel+0x1d/0x200 mm/memory.c:459
      	 apply_to_pte_range mm/memory.c:2031 [inline]
      	 apply_to_pmd_range mm/memory.c:2068 [inline]
      	 apply_to_pud_range mm/memory.c:2088 [inline]
      	 apply_to_p4d_range mm/memory.c:2108 [inline]
      	 apply_to_page_range+0x77d/0xa00 mm/memory.c:2133
      	 kasan_release_vmalloc+0xa7/0xc0 mm/kasan/common.c:970
      	 __purge_vmap_area_lazy+0xcbb/0x1f30 mm/vmalloc.c:1313
      	 try_purge_vmap_area_lazy mm/vmalloc.c:1332 [inline]
      	 free_vmap_area_noflush+0x2ca/0x390 mm/vmalloc.c:1368
      	 free_unmap_vmap_area mm/vmalloc.c:1381 [inline]
      	 remove_vm_area+0x1cc/0x230 mm/vmalloc.c:2209
      	 vm_remove_mappings mm/vmalloc.c:2236 [inline]
      	 __vunmap+0x223/0xa20 mm/vmalloc.c:2299
      	 __vfree+0x3f/0xd0 mm/vmalloc.c:2356
      	 __vmalloc_area_node mm/vmalloc.c:2507 [inline]
      	 __vmalloc_node_range+0x5d5/0x810 mm/vmalloc.c:2547
      	 __vmalloc_node mm/vmalloc.c:2607 [inline]
      	 __vmalloc_node_flags mm/vmalloc.c:2621 [inline]
      	 vzalloc+0x6f/0x80 mm/vmalloc.c:2666
      	 alloc_one_pg_vec_page net/packet/af_packet.c:4233 [inline]
      	 alloc_pg_vec net/packet/af_packet.c:4258 [inline]
      	 packet_set_ring+0xbc0/0x1b50 net/packet/af_packet.c:4342
      	 packet_setsockopt+0xed7/0x2d90 net/packet/af_packet.c:3695
      	 __sys_setsockopt+0x29b/0x4d0 net/socket.c:2117
      	 __do_sys_setsockopt net/socket.c:2133 [inline]
      	 __se_sys_setsockopt net/socket.c:2130 [inline]
      	 __x64_sys_setsockopt+0xbe/0x150 net/socket.c:2130
      	 do_syscall_64+0xfa/0x780 arch/x86/entry/common.c:294
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Switch to using the apply_to_existing_page_range() helper instead, which
      won't allocate memory.
      
      [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
      Link: http://lkml.kernel.org/r/20191205140407.1874-2-dja@axtens.net
      Fixes: 3c5c3cfb ("kasan: support backing vmalloc space with real shadow memory")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e218f1ca
    • D
      mm/memory.c: add apply_to_existing_page_range() helper · be1db475
      Daniel Axtens 提交于
      apply_to_page_range() takes an address range, and if any parts of it are
      not covered by the existing page table hierarchy, it allocates memory to
      fill them in.
      
      In some use cases, this is not what we want - we want to be able to
      operate exclusively on PTEs that are already in the tables.
      
      Add apply_to_existing_page_range() for this.  Adjust the walker
      functions for apply_to_page_range to take 'create', which switches them
      between the old and new modes.
      
      This will be used in KASAN vmalloc.
      
      [akpm@linux-foundation.org: reduce code duplication]
      [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
      [akpm@linux-foundation.org: initialize __apply_to_page_range::err]
      Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1db475
    • A
      kasan: fix crashes on access to memory mapped by vm_map_ram() · d98c9e83
      Andrey Ryabinin 提交于
      With CONFIG_KASAN_VMALLOC=y any use of memory obtained via vm_map_ram()
      will crash because there is no shadow backing that memory.
      
      Instead of sprinkling additional kasan_populate_vmalloc() calls all over
      the vmalloc code, move it into alloc_vmap_area(). This will fix
      vm_map_ram() and simplify the code a bit.
      
      [aryabinin@virtuozzo.com: v2]
        Link: http://lkml.kernel.org/r/20191205095942.1761-1-aryabinin@virtuozzo.comLink: http://lkml.kernel.org/r/20191204204534.32202-1-aryabinin@virtuozzo.com
      Fixes: 3c5c3cfb ("kasan: support backing vmalloc space with real shadow memory")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d98c9e83
  11. 17 12月, 2019 1 次提交
    • G
      mm: hugetlb controller for cgroups v2 · faced7e0
      Giuseppe Scrivano 提交于
      In the effort of supporting cgroups v2 into Kubernetes, I stumped on
      the lack of the hugetlb controller.
      
      When the controller is enabled, it exposes four new files for each
      hugetlb size on non-root cgroups:
      
      - hugetlb.<hugepagesize>.current
      - hugetlb.<hugepagesize>.max
      - hugetlb.<hugepagesize>.events
      - hugetlb.<hugepagesize>.events.local
      
      The differences with the legacy hierarchy are in the file names and
      using the value "max" instead of "-1" to disable a limit.
      
      The file .limit_in_bytes is renamed to .max.
      
      The file .usage_in_bytes is renamed to .current.
      
      .failcnt is not provided as a single file anymore, but its value can
      be read through the new flat-keyed files .events and .events.local,
      through the "max" key.
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      faced7e0
  12. 10 12月, 2019 1 次提交
    • I
      mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions · 186525bd
      Ingo Molnar 提交于
      - Untangle the somewhat incestous way of how VMALLOC_START is used all across the
        kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
        It doesn't help that vmalloc.h only includes a single asm header:
      
           #include <asm/page.h>           /* pgprot_t */
      
        So there was no existing cross-arch way to decouple address layout
        definitions from page.h details. I used this:
      
         #ifndef VMALLOC_START
         # include <asm/vmalloc.h>
         #endif
      
        This way every architecture that wants to simplify page.h can do so.
      
      - Also on x86 we had a couple of LDT related inline functions that used
        the late-stage address space layout positions - but these could be
        uninlined without real trouble - the end result is cleaner this way as
        well.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      186525bd
  13. 08 12月, 2019 1 次提交
  14. 05 12月, 2019 5 次提交
    • M
      mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h · f949286c
      Mike Rapoport 提交于
      There are no architectures that use include/asm-generic/4level-fixup.h
      therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define.
      
      Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Anatoly Pugachev <matorola@gmail.com>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Peter Rosin <peda@axentia.se>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f949286c
    • Y
      mm/memory.c: replace is_zero_pfn with is_huge_zero_pmd for thp · 3cde287b
      Yu Zhao 提交于
      For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero
      page or not.
      
      We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this
      is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock()
      makes sure of it.
      
      This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
      complains about it.
      
      Gerald Schaefer asked:
      : Maybe the description could also mention the symptom of this bug?
      : I would assume that it affects anon/dirty accounting in gather_pte_stats(),
      : for huge mappings, if zero page mappings are not correctly recognized.
      
      I came across this while I was looking at the code, so I'm not aware of
      any symptom.
      
      Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cde287b
    • K
      mm/memcontrol: use vmstat names for printing statistics · ebc5d83d
      Konstantin Khlebnikov 提交于
      Use common names from vmstat array when possible.  This gives not much
      difference in code size for now, but should help in keeping interfaces
      consistent.
      
        add/remove: 0/2 grow/shrink: 2/0 up/down: 70/-72 (-2)
        Function                                     old     new   delta
        memory_stat_format                           984    1050     +66
        memcg_stat_show                              957     961      +4
        memcg1_event_names                            32       -     -32
        mem_cgroup_lru_names                          40       -     -40
        Total: Before=14485337, After=14485335, chg -0.00%
      
      Link: http://lkml.kernel.org/r/157113012508.453.80391533767219371.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc5d83d
    • K
      mm/vmstat: add helpers to get vmstat item names for each enum type · 9d7ea9a2
      Konstantin Khlebnikov 提交于
      Statistics in vmstat is combined from counters with different structure,
      but names for them are merged into one array.
      
      This patch adds trivial helpers to get name for each item:
      
        const char *zone_stat_name(enum zone_stat_item item);
        const char *numa_stat_name(enum numa_stat_item item);
        const char *node_stat_name(enum node_stat_item item);
        const char *writeback_stat_name(enum writeback_stat_item item);
        const char *vm_event_name(enum vm_event_item item);
      
      Names for enum writeback_stat_item are folded in the middle of
      vmstat_text so this patch moves declaration into header to calculate
      offset of following items.
      
      Also this patch reuses piece of node stat names for lru list names:
      
        const char *lru_list_name(enum lru_list lru);
      
      This returns common lru list names: "inactive_anon", "active_anon",
      "inactive_file", "active_file", "unevictable".
      
      [khlebnikov@yandex-team.ru: do not use size of vmstat_text as count of /proc/vmstat items]
        Link: http://lkml.kernel.org/r/157152151769.4139.15423465513138349343.stgit@buzz
        Link: https://lore.kernel.org/linux-mm/cd1c42ae-281f-c8a8-70ac-1d01d417b2e1@infradead.org/T/#u
      Link: http://lkml.kernel.org/r/157113012325.453.562783073839432766.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d7ea9a2
    • R
      mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction · a264df74
      Roman Gushchin 提交于
      Christian reported a warning like the following obtained during running
      some KVM-related tests on s390:
      
          WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
          Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
          CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
          Hardware name: IBM 2964 NC9 712 (LPAR)
          Workqueue: events sysfs_slab_remove_workfn
          Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
                     R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
          Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
                     0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
                     0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
                     0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
          Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
                     0000001529746848: 47000700            bc         0,1792
                    #000000152974684c: a7f40001            brc        15,152974684e
                    >0000001529746850: a7f4fff2            brc        15,1529746834
                     0000001529746854: 0707                bcr        0,%r7
                     0000001529746856: 0707                bcr        0,%r7
                     0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
                     000000152974685e: a738ffff            lhi        %r3,-1
          Call Trace:
          ([<000003e009263d00>] 0x3e009263d00)
           [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
           [<0000001529b04882>] kobject_put+0xaa/0xe8
           [<000000152918cf28>] process_one_work+0x1e8/0x428
           [<000000152918d1b0>] worker_thread+0x48/0x460
           [<00000015291942c6>] kthread+0x126/0x160
           [<0000001529b22344>] ret_from_fork+0x28/0x30
           [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
          Last Breaking-Event-Address:
           [<000000152974684c>] percpu_ref_exit+0x4c/0x58
          ---[ end trace b035e7da5788eb09 ]---
      
      The problem occurs because kmem_cache_destroy() is called immediately
      after deleting of a memcg, so it races with the memcg kmem_cache
      deactivation.
      
      flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
      supposed to guarantee that all deactivation processes are finished, but
      failed to do so.  It waits for an rcu grace period, after which all
      children kmem_caches should be deactivated.  During the deactivation
      percpu_ref_kill() is called for non root kmem_cache refcounters, but it
      requires yet another rcu grace period to finish the transition to the
      atomic (dead) state.
      
      So in a rare case when not all children kmem_caches are destroyed at the
      moment when the root kmem_cache is about to be gone, we need to wait
      another rcu grace period before destroying the root kmem_cache.
      
      This issue can be triggered only with dynamically created kmem_caches
      which are used with memcg accounting.  In this case per-memcg child
      kmem_caches are created.  They are deactivated from the cgroup removing
      path.  If the destruction of the root kmem_cache is racing with the
      removal of the cgroup (both are quite complicated multi-stage
      processes), the described issue can occur.  The only known way to
      trigger it in the real life, is to unload some kernel module which
      creates a dedicated kmem_cache, used from different memory cgroups with
      GFP_ACCOUNT flag.  If the unloading happens immediately after calling
      rmdir on the corresponding cgroup, there is some chance to trigger the
      issue.
      
      Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
      Fixes: f0a3a24b ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a264df74
新手
引导
客服 返回
顶部