1. 16 12月, 2020 30 次提交
  2. 12 12月, 2020 3 次提交
    • G
      mm/hugetlb: clear compound_nr before freeing gigantic pages · ba9c1201
      Gerald Schaefer 提交于
      Commit 1378a5ee ("mm: store compound_nr as well as compound_order")
      added compound_nr counter to first tail struct page, overlaying with
      page->mapping.  The overlay itself is fine, but while freeing gigantic
      hugepages via free_contig_range(), a "bad page" check will trigger for
      non-NULL page->mapping on the first tail page:
      
        BUG: Bad page state in process bash  pfn:380001
        page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
        aops:0x0
        flags: 0x3ffff00000000000()
        raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
        raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
        page dumped because: non-NULL mapping
        Modules linked in:
        CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
        Hardware name: IBM 3906 M03 703 (LPAR)
        Call Trace:
          show_stack+0x6e/0xe8
          dump_stack+0x90/0xc8
          bad_page+0xd6/0x130
          free_pcppages_bulk+0x26a/0x800
          free_unref_page+0x6e/0x90
          free_contig_range+0x94/0xe8
          update_and_free_page+0x1c4/0x2c8
          free_pool_huge_page+0x11e/0x138
          set_max_huge_pages+0x228/0x300
          nr_hugepages_store_common+0xb8/0x130
          kernfs_fop_write+0xd2/0x218
          vfs_write+0xb0/0x2b8
          ksys_write+0xac/0xe0
          system_call+0xe6/0x288
        Disabling lock debugging due to kernel taint
      
      This is because only the compound_order is cleared in
      destroy_compound_gigantic_page(), and compound_nr is set to
      1U << order == 1 for order 0 in set_compound_order(page, 0).
      
      Fix this by explicitly clearing compound_nr for first tail page after
      calling set_compound_order(page, 0).
      
      Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
      Fixes: 1378a5ee ("mm: store compound_nr as well as compound_order")
      Signed-off-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9c1201
    • K
      kasan: fix object remaining in offline per-cpu quarantine · 6c82d45c
      Kuan-Ying Lee 提交于
      We hit this issue in our internal test.  When enabling generic kasan, a
      kfree()'d object is put into per-cpu quarantine first.  If the cpu goes
      offline, object still remains in the per-cpu quarantine.  If we call
      kmem_cache_destroy() now, slub will report "Objects remaining" error.
      
        =============================================================================
        BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        Disabling lock debugging due to kernel taint
        INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
        CPU: 3 PID: 176 Comm: cat Tainted: G    B             5.10.0-rc1-00007-g4525c878-dirty #10
        Hardware name: linux,dummy-virt (DT)
        Call trace:
           dump_backtrace+0x0/0x2b0
           show_stack+0x18/0x68
           dump_stack+0xfc/0x168
           slab_err+0xac/0xd4
           __kmem_cache_shutdown+0x1e4/0x3c8
           kmem_cache_destroy+0x68/0x130
           test_version_show+0x84/0xf0
           module_attr_show+0x40/0x60
           sysfs_kf_seq_show+0x128/0x1c0
           kernfs_seq_show+0xa0/0xb8
           seq_read+0x1f0/0x7e8
           kernfs_fop_read+0x70/0x338
           vfs_read+0xe4/0x250
           ksys_read+0xc8/0x180
           __arm64_sys_read+0x44/0x58
           el0_svc_common.constprop.0+0xac/0x228
           do_el0_svc+0x38/0xa0
           el0_sync_handler+0x170/0x178
           el0_sync+0x174/0x180
        INFO: Object 0x(____ptrval____) @offset=15848
        INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
           stack_trace_save+0x9c/0xd0
           set_track+0x64/0xf0
           alloc_debug_processing+0x104/0x1a0
           ___slab_alloc+0x628/0x648
           __slab_alloc.isra.0+0x2c/0x58
           kmem_cache_alloc+0x560/0x588
           test_version_show+0x98/0xf0
           module_attr_show+0x40/0x60
           sysfs_kf_seq_show+0x128/0x1c0
           kernfs_seq_show+0xa0/0xb8
           seq_read+0x1f0/0x7e8
           kernfs_fop_read+0x70/0x338
           vfs_read+0xe4/0x250
           ksys_read+0xc8/0x180
           __arm64_sys_read+0x44/0x58
           el0_svc_common.constprop.0+0xac/0x228
        kmem_cache_destroy test_module_slab: Slab cache still has objects
      
      Register a cpu hotplug function to remove all objects in the offline
      per-cpu quarantine when cpu is going offline.  Set a per-cpu variable to
      indicate this cpu is offline.
      
      [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
        Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com
      
      Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.comSigned-off-by: NKuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Signed-off-by: NZqiang <qiang.zhang@windriver.com>
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NGuangye Yang <guangye.yang@mediatek.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Nicholas Tang <nicholas.tang@mediatek.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Qian Cai <qcai@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c82d45c
    • A
      revert "mm/filemap: add static for function __add_to_page_cache_locked" · 16c0cc0c
      Andrew Morton 提交于
      Revert commit 3351b16a ("mm/filemap: add static for function
      __add_to_page_cache_locked") due to incompatibility with
      ALLOW_ERROR_INJECTION which result in build errors.
      
      Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.comTested-by: NJustin Forbes <jmforbes@linuxtx.org>
      Tested-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Tony Luck <tony.luck@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c0cc0c
  3. 09 12月, 2020 1 次提交
    • M
      mm/madvise: remove racy mm ownership check · a68a0262
      Minchan Kim 提交于
      Jann spotted the security hole due to race of mm ownership check.
      
      If the task is sharing the mm_struct but goes through execve() before
      mm_access(), it could skip process_madvise_behavior_valid check.  That
      makes *any advice hint* to reach into the remote process.
      
      This patch removes the mm ownership check.  With it, it will lose the
      ability that local process could give *any* advice hint with vector
      interface for some reason (e.g., performance).  Since there is no
      concrete example in upstream yet, it would be better to remove the
      abiliity at this moment and need to review when such new advice comes
      up.
      
      Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a68a0262
  4. 07 12月, 2020 6 次提交
    • L
      mm/mmap.c: fix mmap return value when vma is merged after call_mmap() · 309d08d9
      Liu Zixian 提交于
      On success, mmap should return the begin address of newly mapped area,
      but patch "mm: mmap: merge vma after call_mmap() if possible" set
      vm_start of newly merged vma to return value addr.  Users of mmap will
      get wrong address if vma is merged after call_mmap().  We fix this by
      moving the assignment to addr before merging vma.
      
      We have a driver which changes vm_flags, and this bug is found by our
      testcases.
      
      Fixes: d70cec89 ("mm: mmap: merge vma after call_mmap() if possible")
      Signed-off-by: NLiu Zixian <liuzixian4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Hongxiang Lou <louhongxiang@huawei.com>
      Cc: Hu Shiyuan <hushiyuan@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309d08d9
    • M
      hugetlb_cgroup: fix offline of hugetlb cgroup with reservations · 7a5bde37
      Mike Kravetz 提交于
      Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
      using hugetlbfs.  In this environment the issue is reproduced by:
      
       - Start a simple pod that uses the recently added HugePages medium
         feature (pod yaml attached)
      
       - Start a DPDK app. It doesn't need to run successfully (as in transfer
         packets) nor interact with real hardware. It seems just initializing
         the EAL layer (which handles hugepage reservation and locking) is
         enough to trigger the issue
      
       - Delete the Pod (or let it "Complete").
      
      This would result in a kworker thread going into a tight loop (top output):
      
         1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45 kworker/28:7+cgroup_destroy
      
      'perf top -g' reports:
      
        -   63.28%     0.01%  [kernel]                    [k] worker_thread
           - 49.97% worker_thread
              - 52.64% process_one_work
                 - 62.08% css_killed_work_fn
                    - hugetlb_cgroup_css_offline
                         41.52% _raw_spin_lock
                       - 2.82% _cond_resched
                            rcu_all_qs
                         2.66% PageHuge
              - 0.57% schedule
                 - 0.57% __schedule
      
      We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
      Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
      infinitely spinning.  Little else can be done on the system as the
      cgroup_mutex can not be acquired.
      
      Do note that the issue can be reproduced by simply offlining a hugetlb
      cgroup containing pages with reservation counts.
      
      The loop in hugetlb_cgroup_css_offline is moving page counts from the
      cgroup being offlined to the parent cgroup.  This is done for each
      hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
      The routine moving counts (hugetlb_cgroup_move_parent) is only moving
      'usage' counts.  The routine hugetlb_cgroup_have_usage is checking for
      both 'usage' and 'reservation' counts.  Discussion about what to do with
      reservation counts when reparenting was discussed here:
      
      https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
      
      The decision was made to leave a zombie cgroup for with reservation
      counts.  Unfortunately, the code checking reservation counts was
      incorrectly added to hugetlb_cgroup_have_usage.
      
      To fix the issue, simply remove the check for reservation counts.  While
      fixing this issue, a related bug in hugetlb_cgroup_css_offline was
      noticed.  The hstate index is not reinitialized each time through the
      do-while loop.  Fix this as well.
      
      Fixes: 1adc4d41 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
      Reported-by: NAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NAdrian Moreno <amorenoz@redhat.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a5bde37
    • A
      mm/filemap: add static for function __add_to_page_cache_locked · 3351b16a
      Alex Shi 提交于
        mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3351b16a
    • Q
      mm/swapfile: do not sleep with a spin lock held · b11a76b3
      Qian Cai 提交于
      We can't call kvfree() with a spin lock held, so defer it.  Fixes a
      might_sleep() runtime warning.
      
      Fixes: 873d7bcf ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
      Signed-off-by: NQian Cai <qcai@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b11a76b3
    • M
      mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING · e91d8d78
      Minchan Kim 提交于
      While I was doing zram testing, I found sometimes decompression failed
      since the compression buffer was corrupted.  With investigation, I found
      below commit calls cond_resched unconditionally so it could make a
      problem in atomic context if the task is reschedule.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:108
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
        3 locks held by memhog/946:
         #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
         #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
         #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
        CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
        Call Trace:
          unmap_kernel_range_noflush+0x2eb/0x350
          unmap_kernel_range+0x14/0x30
          zs_unmap_object+0xd5/0xe0
          zram_bvec_rw.isra.0+0x38c/0x8e0
          zram_rw_page+0x90/0x101
          bdev_write_page+0x92/0xe0
          __swap_writepage+0x94/0x4a0
          pageout+0xe3/0x3a0
          shrink_page_list+0xb94/0xd60
          shrink_inactive_list+0x158/0x460
      
      We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
      contains the offending calling code) from zsmalloc.
      
      Even though this option showed some amount improvement(e.g., 30%) in
      some arm32 platforms, it has been headache to maintain since it have
      abused APIs[1](e.g., unmap_kernel_range in atomic context).
      
      Since we are approaching to deprecate 32bit machines and already made
      the config option available for only builtin build since v5.8, lastly it
      has been not default option in zsmalloc, it's time to drop the option
      for better maintenance.
      
      [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
      
      Fixes: e47110e9 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Harish Sriram <harish@linux.ibm.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e91d8d78
    • Y
      mm: list_lru: set shrinker map bit when child nr_items is not zero · 8199be00
      Yang Shi 提交于
      When investigating a slab cache bloat problem, significant amount of
      negative dentry cache was seen, but confusingly they neither got shrunk
      by reclaimer (the host has very tight memory) nor be shrunk by dropping
      cache.  The vmcore shows there are over 14M negative dentry objects on
      lru, but tracing result shows they were even not scanned at all.
      
      Further investigation shows the memcg's vfs shrinker_map bit is not set.
      So the reclaimer or dropping cache just skip calling vfs shrinker.  So
      we have to reboot the hosts to get the memory back.
      
      I didn't manage to come up with a reproducer in test environment, and
      the problem can't be reproduced after rebooting.  But it seems there is
      race between shrinker map bit clear and reparenting by code inspection.
      The hypothesis is elaborated as below.
      
      The memcg hierarchy on our production environment looks like:
      
                      root
                     /    \
                system   user
      
      The main workloads are running under user slice's children, and it
      creates and removes memcg frequently.  So reparenting happens very often
      under user slice, but no task is under user slice directly.
      
      So with the frequent reparenting and tight memory pressure, the below
      hypothetical race condition may happen:
      
             CPU A                            CPU B
      reparent
          dst->nr_items == 0
                                       shrinker:
                                           total_objects == 0
          add src->nr_items to dst
          set_bit
                                           return SHRINK_EMPTY
                                           clear_bit
      child memcg offline
          replace child's kmemcg_id with
          parent's (in memcg_offline_kmem())
                                        list_lru_del() between shrinker runs
                                           see parent's kmemcg_id
                                           dec dst->nr_items
      reparent again
          dst->nr_items may go negative
          due to concurrent list_lru_del()
      
                                       The second run of shrinker:
                                           read nr_items without any
                                           synchronization, so it may
                                           see intermediate negative
                                           nr_items then total_objects
                                           may return 0 coincidently
      
                                           keep the bit cleared
          dst->nr_items != 0
          skip set_bit
          add scr->nr_item to dst
      
      After this point dst->nr_item may never go zero, so reparenting will not
      set shrinker_map bit anymore.  And since there is no task under user
      slice directly, so no new object will be added to its lru to set the
      shrinker map bit either.  That bit is kept cleared forever.
      
      How does list_lru_del() race with reparenting? It is because reparenting
      replaces children's kmemcg_id to parent's without protecting from
      nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
      deleting items from child's lru, but dec'ing parent's nr_items, so the
      parent's nr_items may go negative as commit 2788cf0c ("memcg:
      reparent list_lrus and free kmemcg_id on css offline") says.
      
      Since it is impossible that dst->nr_items goes negative and
      src->nr_items goes zero at the same time, so it seems we could set the
      shrinker map bit iff src->nr_items != 0.  We could synchronize
      list_lru_count_one() and reparenting with nlru->lock, but it seems
      checking src->nr_items in reparenting is the simplest and avoids lock
      contention.
      
      Fixes: fae91d6d ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.19]
      Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8199be00