1. 20 9月, 2020 3 次提交
  2. 18 9月, 2020 2 次提交
    • S
      percpu: fix first chunk size calculation for populated bitmap · b3b33d3c
      Sunghyun Jin 提交于
      Variable populated, which is a member of struct pcpu_chunk, is used as a
      unit of size of unsigned long.
      However, size of populated is miscounted. So, I fix this minor part.
      
      Fixes: 8ab16c43 ("percpu: change the number of pages marked in the first_chunk pop bitmap")
      Cc: <stable@vger.kernel.org> # 4.14+
      Signed-off-by: NSunghyun Jin <mcsmonk@gmail.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      b3b33d3c
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  3. 06 9月, 2020 12 次提交
    • D
      mm/khugepaged.c: fix khugepaged's request size in collapse_file · e5a59d30
      David Howells 提交于
      collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
      be read to page_cache_sync_readahead().  The intent was probably to read
      a single page.  Fix it to use the number of pages to the end of the
      window instead.
      
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5a59d30
    • M
      mm/hugetlb: fix a race between hugetlb sysctl handlers · 17743798
      Muchun Song 提交于
      There is a race between the assignment of `table->data` and write value
      to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
      the other thread.
      
        CPU0:                                 CPU1:
                                              proc_sys_write
        hugetlb_sysctl_handler                  proc_sys_call_handler
        hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
          table->data = &tmp;                       hugetlb_sysctl_handler_common
                                                      table->data = &tmp;
            proc_doulongvec_minmax
              do_proc_doulongvec_minmax           sysctl_head_finish
                __do_proc_doulongvec_minmax         unuse_table
                  i = table->data;
                  *i = val;  // corrupt CPU1's stack
      
      Fix this by duplicating the `table`, and only update the duplicate of
      it.  And introduce a helper of proc_hugetlb_doulongvec_minmax() to
      simplify the code.
      
      The following oops was seen:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000000
          #PF: supervisor instruction fetch in kernel mode
          #PF: error_code(0x0010) - not-present page
          Code: Bad RIP value.
          ...
          Call Trace:
           ? set_max_huge_pages+0x3da/0x4f0
           ? alloc_pool_huge_page+0x150/0x150
           ? proc_doulongvec_minmax+0x46/0x60
           ? hugetlb_sysctl_handler_common+0x1c7/0x200
           ? nr_hugepages_store+0x20/0x20
           ? copy_fd_bitmaps+0x170/0x170
           ? hugetlb_sysctl_handler+0x1e/0x20
           ? proc_sys_call_handler+0x2f1/0x300
           ? unregister_sysctl_table+0xb0/0xb0
           ? __fd_install+0x78/0x100
           ? proc_sys_write+0x14/0x20
           ? __vfs_write+0x4d/0x90
           ? vfs_write+0xef/0x240
           ? ksys_write+0xc0/0x160
           ? __ia32_sys_read+0x50/0x50
           ? __close_fd+0x129/0x150
           ? __x64_sys_write+0x43/0x50
           ? do_syscall_64+0x6c/0x200
           ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e5ff2159 ("hugetlb: multiple hstates for multiple page sizes")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17743798
    • L
      mm/hugetlb: try preferred node first when alloc gigantic page from cma · 953f064a
      Li Xinhai 提交于
      Since commit cf11e85f ("mm: hugetlb: optionally allocate gigantic
      hugepages using cma"), the gigantic page would be allocated from node
      which is not the preferred node, although there are pages available from
      that node.  The reason is that the nid parameter has been ignored in
      alloc_gigantic_page().
      
      Besides, the __GFP_THISNODE also need be checked if user required to
      alloc only from the preferred node.
      
      After this patch, the preferred node is tried first before other allowed
      nodes, and don't try to allocate from other nodes if __GFP_THISNODE is
      specified.  If user don't specify the preferred node, the current node
      will be used as preferred node, which makes sure consistent behavior of
      allocating gigantic and non-gigantic hugetlb page.
      
      Fixes: cf11e85f ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      953f064a
    • R
      mm/migrate: preserve soft dirty in remove_migration_pte() · 3d321bf8
      Ralph Campbell 提交于
      The code to remove a migration PTE and replace it with a device private
      PTE was not copying the soft dirty bit from the migration entry.  This
      could lead to page contents not being marked dirty when faulting the page
      back from device private memory.
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d321bf8
    • R
      mm/migrate: remove unnecessary is_zone_device_page() check · 6128763f
      Ralph Campbell 提交于
      Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".
      
      I happened to notice this from code inspection after seeing Alistair
      Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").
      
      This patch (of 2):
      
      The check for is_zone_device_page() and is_device_private_page() is
      unnecessary since the latter is sufficient to determine if the page is a
      device private page.  Simplify the code for easier reading.
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
      Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6128763f
    • A
      mm/rmap: fixup copying of soft dirty and uffd ptes · ad7df764
      Alistair Popple 提交于
      During memory migration a pte is temporarily replaced with a migration
      swap pte.  Some pte bits from the existing mapping such as the soft-dirty
      and uffd write-protect bits are preserved by copying these to the
      temporary migration swap pte.
      
      However these bits are not stored at the same location for swap and
      non-swap ptes.  Therefore testing these bits requires using the
      appropriate helper function for the given pte type.
      
      Unfortunately several code locations were found where the wrong helper
      function is being used to test soft_dirty and uffd_wp bits which leads to
      them getting incorrectly set or cleared during page-migration.
      
      Fix these by using the correct tests based on pte type.
      
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Fixes: 8c3328f1 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.auSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad7df764
    • A
      mm/migrate: fixup setting UFFD_WP flag · ebdf8321
      Alistair Popple 提交于
      Commit f45ec5ff ("userfaultfd: wp: support swap and page migration")
      introduced support for tracking the uffd wp bit during page migration.
      However the non-swap PTE variant was used to set the flag for zone device
      private pages which are a type of swap page.
      
      This leads to corruption of the swap offset if the original PTE has the
      uffd_wp flag set.
      
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.auSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebdf8321
    • Y
      mm: madvise: fix vma user-after-free · 7867fd7c
      Yang Shi 提交于
      The syzbot reported the below use-after-free:
      
        BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
        BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
        BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
        Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996
      
        CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x18f/0x20d lib/dump_stack.c:118
          print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
          __kasan_report mm/kasan/report.c:513 [inline]
          kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
          madvise_willneed mm/madvise.c:293 [inline]
          madvise_vma mm/madvise.c:942 [inline]
          do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
          do_madvise mm/madvise.c:1169 [inline]
          __do_sys_madvise mm/madvise.c:1171 [inline]
          __se_sys_madvise mm/madvise.c:1169 [inline]
          __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Allocated by task 9992:
          kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
          vm_area_alloc+0x1c/0x110 kernel/fork.c:347
          mmap_region+0x8e5/0x1780 mm/mmap.c:1743
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 9992:
          kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
          remove_vma+0x132/0x170 mm/mmap.c:184
          remove_vma_list mm/mmap.c:2613 [inline]
          __do_munmap+0x743/0x1170 mm/mmap.c:2869
          do_munmap mm/mmap.c:2877 [inline]
          mmap_region+0x257/0x1780 mm/mmap.c:1716
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It is because vma is accessed after releasing mmap_lock, but someone
      else acquired the mmap_lock and the vma is gone.
      
      Releasing mmap_lock after accessing vma should fix the problem.
      
      Fixes: 692fe624 ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
      Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.4+]
      Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7867fd7c
    • J
      mm: track page table modifications in __apply_to_page_range() · e80d3909
      Joerg Roedel 提交于
      __apply_to_page_range() is also used to change and/or allocate
      page-table pages in the vmalloc area of the address space.  Make sure
      these changes get synchronized to other page-tables in the system by
      calling arch_sync_kernel_mappings() when necessary.
      
      The impact appears limited to x86-32, where apply_to_page_range may miss
      updating the PMD.  That leads to explosions in drivers like
      
        BUG: unable to handle page fault for address: fe036000
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        *pde = 00000000
        Oops: 0002 [#1] SMP
        CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16
        Hardware name:  /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
        EIP: __execlists_context_alloc+0x132/0x2d0 [i915]
        Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81
        EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001
        ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
        CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0
        DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
        DR6: fffe0ff0 DR7: 00000400
        Call Trace:
          execlists_context_alloc+0x10/0x20 [i915]
          intel_context_alloc_state+0x3f/0x70 [i915]
          __intel_context_do_pin+0x117/0x170 [i915]
          i915_gem_do_execbuffer+0xcc7/0x2500 [i915]
          i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915]
          drm_ioctl_kernel+0x8f/0xd0
          drm_ioctl+0x223/0x3d0
          __ia32_sys_ioctl+0x1ab/0x760
          __do_fast_syscall_32+0x3f/0x70
          do_fast_syscall_32+0x29/0x60
          do_SYSENTER_32+0x15/0x20
          entry_SYSENTER_32+0x9f/0xf2
        EIP: 0xb7f28559
        Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
        EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c
        ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8
        DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
        Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan
        CR2: 00000000fe036000
      
      It looks like kasan, xen and i915 are vulnerable.
      
      Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking
      after 30-or-so minutes, and machine is unusable"
      
      [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h]
        Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au
      [chris@chris-wilson.co.uk: changelog addition]
      [pavel@ucw.cz: changelog addition]
      
      Fixes: 2ba3e694 ("mm/vmalloc: track which page-table levels were modified")
      Fixes: 86cf69f1 ("x86/mm/32: implement arch_sync_kernel_mappings()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Chris Wilson <chris@chris-wilson.co.uk>	[x86-32]
      Tested-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80d3909
    • E
      mm: slub: fix conversion of freelist_corrupted() · dc07a728
      Eugeniu Rosca 提交于
      Commit 52f23478 ("mm/slub.c: fix corrupted freechain in
      deactivate_slab()") suffered an update when picked up from LKML [1].
      
      Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
      created a no-op statement.  Fix it by sticking to the behavior intended
      in the original patch [1].  In addition, make freelist_corrupted()
      immune to passing NULL instead of &freelist.
      
      The issue has been spotted via static analysis and code review.
      
      [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/
      
      Fixes: 52f23478 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
      Signed-off-by: NEugeniu Rosca <erosca@de.adit-jv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc07a728
    • X
      mm: memcg: fix memcg reclaim soft lockup · e3336cab
      Xunlei Pang 提交于
      We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
      doesn't have any reclaimable memory.
      
      It can be easily reproduced as below:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
        CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
        Call Trace:
          shrink_lruvec+0x49f/0x640
          shrink_node+0x2a6/0x6f0
          do_try_to_free_pages+0xe9/0x3e0
          try_to_free_mem_cgroup_pages+0xef/0x1f0
          try_charge+0x2c1/0x750
          mem_cgroup_charge+0xd7/0x240
          __add_to_page_cache_locked+0x2fd/0x370
          add_to_page_cache_lru+0x4a/0xc0
          pagecache_get_page+0x10b/0x2f0
          filemap_fault+0x661/0xad0
          ext4_filemap_fault+0x2c/0x40
          __do_fault+0x4d/0xf9
          handle_mm_fault+0x1080/0x1790
      
      It only happens on our 1-vcpu instances, because there's no chance for
      oom reaper to run to reclaim the to-be-killed process.
      
      Add a cond_resched() at the upper shrink_node_memcgs() to solve this
      issue, this will mean that we will get a scheduling point for each memcg
      in the reclaimed hierarchy without any dependency on the reclaimable
      memory in that memcg thus making it more predictable.
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3336cab
    • M
      memcg: fix use-after-free in uncharge_batch · f1796544
      Michal Hocko 提交于
      syzbot has reported an use-after-free in the uncharge_batch path
      
        BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
        BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
        BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
        BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline]
        BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
        Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304
      
        CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x1f0/0x31e lib/dump_stack.c:118
          print_address_description+0x66/0x620 mm/kasan/report.c:383
          __kasan_report mm/kasan/report.c:513 [inline]
          kasan_report+0x132/0x1d0 mm/kasan/report.c:530
          check_memory_region_inline mm/kasan/generic.c:183 [inline]
          check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192
          instrument_atomic_write include/linux/instrumented.h:71 [inline]
          atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
          atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
          page_counter_cancel mm/page_counter.c:54 [inline]
          page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
          uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764
          uncharge_page+0x115/0x430 mm/memcontrol.c:6796
          uncharge_list mm/memcontrol.c:6835 [inline]
          mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877
          release_pages+0x13a2/0x1550 mm/swap.c:911
          tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
          tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
          tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249
          tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328
          exit_mmap+0x296/0x550 mm/mmap.c:3185
          __mmput+0x113/0x370 kernel/fork.c:1076
          exit_mm+0x4cd/0x550 kernel/exit.c:483
          do_exit+0x576/0x1f20 kernel/exit.c:793
          do_group_exit+0x161/0x2d0 kernel/exit.c:903
          get_signal+0x139b/0x1d30 kernel/signal.c:2743
          arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811
          exit_to_user_mode_loop kernel/entry/common.c:135 [inline]
          exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166
          syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Commit 1a3e1f40 ("mm: memcontrol: decouple reference counting from
      page accounting") reworked the memcg lifetime to be bound the the struct
      page rather than charges.  It also removed the css_put_many from
      uncharge_batch and that is causing the above splat.
      
      uncharge_batch() is supposed to uncharge accumulated charges for all
      pages freed from the same memcg.  The queuing is done by uncharge_page
      which however drops the memcg reference after it adds charges to the
      batch.  If the current page happens to be the last one holding the
      reference for its memcg then the memcg is OK to go and the next page to
      be freed will trigger batched uncharge which needs to access the memcg
      which is gone already.
      
      Fix the issue by taking a reference for the memcg in the current batch.
      
      Fixes: 1a3e1f40 ("mm: memcontrol: decouple reference counting from page accounting")
      Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com
      Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1796544
  4. 05 9月, 2020 4 次提交
  5. 04 9月, 2020 2 次提交
    • R
      memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC · 4533d3ae
      Roger Pau Monne 提交于
      This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
      being used by non DAX devices.
      
      No functional change intended.
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: https://lore.kernel.org/r/20200901083326.21264-3-roger.pau@citrix.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      4533d3ae
    • D
      mm: fix pin vs. gup mismatch with gate pages · 9fa2dd94
      Dave Hansen 提交于
      Gate pages were missed when converting from get to pin_user_pages().
      This can lead to refcount imbalances.  This is reliably and quickly
      reproducible running the x86 selftests when vsyscall=emulate is enabled
      (the default).  Fix by using try_grab_page() with appropriate flags
      passed.
      
      The long story:
      
      Today, pin_user_pages() and get_user_pages() are similar interfaces for
      manipulating page reference counts.  However, "pins" use a "bias" value
      and manipulate the actual reference count by 1024 instead of 1 used by
      plain "gets".
      
      That means that pin_user_pages() must be matched with unpin_user_pages()
      and can't be mixed with a plain put_user_pages() or put_page().
      
      Enter gate pages, like the vsyscall page.  They are pages usually in the
      kernel image, but which are mapped to userspace.  Userspace is allowed
      access to them, including interfaces using get/pin_user_pages().  The
      refcount of these kernel pages is manipulated just like a normal user
      page on the get/pin side so that the put/unpin side can work the same
      for normal user pages or gate pages.
      
      get_gate_page() uses try_get_page() which only bumps the refcount by
      1, not 1024, even if called in the pin_user_pages() path.  If someone
      pins a gate page, this happens:
      
      	pin_user_pages()
      		get_gate_page()
      			try_get_page() // bump refcount +1
      	... some time later
      	unpin_user_pages()
      		page_ref_sub_and_test(page, 1024))
      
      ... and boom, we get a refcount off by 1023.  This is reliably and
      quickly reproducible running the x86 selftests when booted with
      vsyscall=emulate (the default).  The selftests use ptrace(), but I
      suspect anything using pin_user_pages() on gate pages could hit this.
      
      To fix it, simply use try_grab_page() instead of try_get_page(), and
      pass 'gup_flags' in so that FOLL_PIN can be respected.
      
      This bug traces back to the very beginning of the FOLL_PIN support in
      commit 3faa52c0 ("mm/gup: track FOLL_PIN pages"), which showed up in
      the 5.7 release.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Fixes: 3faa52c0 ("mm/gup: track FOLL_PIN pages")
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fa2dd94
  6. 24 8月, 2020 1 次提交
  7. 22 8月, 2020 6 次提交
  8. 19 8月, 2020 1 次提交
  9. 15 8月, 2020 9 次提交
    • Q
      mm/swap.c: annotate data races for lru_rotate_pvecs · 7e0cc01e
      Qian Cai 提交于
      Read to lru_add_pvec->nr could be interrupted and then write to the same
      variable.  The write has local interrupt disabled, but the plain reads
      result in data races.  However, it is unlikely the compilers could do much
      damage here given that lru_add_pvec->nr is a "unsigned char" and there is
      an existing compiler barrier.  Thus, annotate the reads using the
      data_race() macro.  The data races were reported by KCSAN,
      
       BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
      
       write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
        rotate_reclaimable_page+0x2df/0x490
        pagevec_add at include/linux/pagevec.h:81
        (inlined by) rotate_reclaimable_page at mm/swap.c:259
        end_page_writeback+0x1b5/0x2b0
        end_swap_bio_write+0x1d0/0x280
        bio_endio+0x297/0x560
        dec_pending+0x218/0x430 [dm_mod]
        clone_endio+0xe4/0x2c0 [dm_mod]
        bio_endio+0x297/0x560
        blk_update_request+0x201/0x920
        scsi_end_request+0x6b/0x4a0
        scsi_io_completion+0xb7/0x7e0
        scsi_finish_command+0x1ed/0x2a0
        scsi_softirq_done+0x1c9/0x1d0
        blk_done_softirq+0x181/0x1d0
        __do_softirq+0xd9/0x57c
        irq_exit+0xa2/0xc0
        do_IRQ+0x8b/0x190
        ret_from_intr+0x0/0x42
        delay_tsc+0x46/0x80
        __const_udelay+0x3c/0x40
        __udelay+0x10/0x20
        kcsan_setup_watchpoint+0x202/0x3a0
        __tsan_read1+0xc2/0x100
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain_cpu at mm/swap.c:602
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       2 locks held by oom02/37761:
        #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
        #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
       irq event stamp: 1949217
       trace_hardirqs_on_thunk+0x1a/0x1c
       __do_softirq+0x2e7/0x57c
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
       Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMarco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e0cc01e
    • Q
      mm/rmap: annotate a data race at tlb_flush_batched · 9c1177b6
      Qian Cai 提交于
      mm->tlb_flush_batched could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
      
       write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
        try_to_unmap_one+0x59a/0x1ab0
        set_tlb_ubc_flush_pending at mm/rmap.c:635
        (inlined by) try_to_unmap_one at mm/rmap.c:1538
        rmap_walk_anon+0x296/0x650
        rmap_walk+0xdf/0x100
        try_to_unmap+0x18a/0x2f0
        shrink_page_list+0xef6/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        balance_pgdat+0x652/0xd90
        kswapd+0x396/0x8d0
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
        flush_tlb_batched_pending+0x29/0x90
        flush_tlb_batched_pending at mm/rmap.c:682
        change_p4d_range+0x5dd/0x1030
        change_pte_range at mm/mprotect.c:44
        (inlined by) change_pmd_range at mm/mprotect.c:212
        (inlined by) change_pud_range at mm/mprotect.c:240
        (inlined by) change_p4d_range at mm/mprotect.c:260
        change_protection+0x222/0x310
        change_prot_numa+0x3e/0x60
        task_numa_work+0x219/0x350
        task_work_run+0xed/0x140
        prepare_exit_to_usermode+0x2cc/0x2e0
        ret_from_intr+0x32/0x42
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 4 PID: 6364 Comm: mtest01 Tainted: G        W    L 5.5.0-next-20200210+ #5
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      flush_tlb_batched_pending() is under PTL but the write is not, but
      mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
      shattered.  Thus, mark it as an intentional data race by using the data
      race macro.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c1177b6
    • Q
      mm/mempool: fix a data race in mempool_free() · abe1de42
      Qian Cai 提交于
      mempool_t pool.curr_nr could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in mempool_free / remove_element
      
       write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
        remove_element+0x4a/0x1c0
        remove_element at mm/mempool.c:132
        mempool_alloc+0x102/0x210
        (inlined by) mempool_alloc at mm/mempool.c:399
        bio_alloc_bioset+0x106/0x2c0
        get_swap_bio+0x49/0x230
        __swap_writepage+0x680/0xc30
        swap_writepage+0x9c/0xf0
        pageout+0x33e/0xae0
        shrink_page_list+0x1f57/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
        <snip>
      
       read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
        mempool_free+0x3e/0x150
        mempool_free at mm/mempool.c:492
        bio_free+0x192/0x280
        bio_put+0x91/0xd0
        end_swap_bio_write+0x1d8/0x280
        bio_endio+0x2c2/0x5b0
        dec_pending+0x22b/0x440 [dm_mod]
        clone_endio+0xe4/0x2c0 [dm_mod]
        bio_endio+0x2c2/0x5b0
        blk_update_request+0x217/0x940
        scsi_end_request+0x6b/0x4d0
        scsi_io_completion+0xb7/0x7e0
        scsi_finish_command+0x223/0x310
        scsi_softirq_done+0x1d5/0x210
        blk_mq_complete_request+0x224/0x250
        scsi_mq_done+0xc2/0x250
        pqi_raid_io_complete+0x5a/0x70 [smartpqi]
        pqi_irq_handler+0x150/0x1410 [smartpqi]
        __handle_irq_event_percpu+0x90/0x540
        handle_irq_event_percpu+0x49/0xd0
        handle_irq_event+0x85/0xca
        handle_edge_irq+0x13f/0x3e0
        do_IRQ+0x86/0x190
        <snip>
      
      Since the write is under pool->lock but the read is done as lockless.
      Even though the commit 5b990546 ("mempool: fix and document
      synchronization and memory barrier usage") introduced the smp_wmb() and
      smp_rmb() pair to improve the situation, it is adequate to protect it
      from data races which could lead to a logic bug, so fix it by adding
      READ_ONCE() for the read.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abe1de42
    • Q
      mm/list_lru: fix a data race in list_lru_count_one · a1f45935
      Qian Cai 提交于
      struct list_lru_one l.nr_items could be accessed concurrently as noticed
      by KCSAN,
      
       BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move
      
       write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
        list_lru_isolate_move+0xf9/0x130
        list_lru_isolate_move at mm/list_lru.c:180
        inode_lru_isolate+0x12b/0x2a0
        __list_lru_walk_one+0x122/0x3d0
        list_lru_walk_one+0x75/0xa0
        prune_icache_sb+0x8b/0xc0
        super_cache_scan+0x1b8/0x250
        do_shrink_slab+0x256/0x6d0
        shrink_slab+0x41b/0x4a0
        shrink_node+0x35c/0xd80
        balance_pgdat+0x652/0xd90
        kswapd+0x396/0x8d0
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
        list_lru_count_one+0x116/0x2f0
        list_lru_count_one at mm/list_lru.c:193
        super_cache_count+0xe8/0x170
        do_shrink_slab+0x95/0x6d0
        shrink_slab+0x41b/0x4a0
        shrink_node+0x35c/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x170/0x700
        __handle_mm_fault+0xc9f/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      A shattered l.nr_items could affect the shrinker behaviour due to a data
      race. Fix it by adding READ_ONCE() for the read. Since the writes are
      aligned and up to word-size, assume those are safe from data races to
      avoid readability issues of writing WRITE_ONCE(var, var + val).
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1f45935
    • Q
      mm/page_counter: fix various data races at memsw · 6e4bd50f
      Qian Cai 提交于
      Commit 3e32cb2e ("mm: memcontrol: lockless page counters") could had
      memcg->memsw->watermark and memcg->memsw->failcnt been accessed
      concurrently as reported by KCSAN,
      
       BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
      
       read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
        page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
        try_charge+0x131/0xd50 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x58/0x140
        __memcg_kmem_charge+0xcc/0x280
        __alloc_pages_nodemask+0x1e1/0x450
        alloc_pages_current+0xa6/0x120
        pte_alloc_one+0x17/0xd0
        __pte_alloc+0x3a/0x1f0
        copy_p4d_range+0xc36/0x1990
        copy_page_range+0x21d/0x360
        dup_mmap+0x5f5/0x7a0
        dup_mm+0xa2/0x240
        copy_process+0x1b3f/0x3460
        _do_fork+0xaa/0xa20
        __x64_sys_clone+0x13b/0x170
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
        page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
        try_charge+0x131/0xd50 mm/memcontrol.c:2405
        mem_cgroup_try_charge+0x159/0x460
        mem_cgroup_try_charge_delay+0x3d/0xa0
        wp_page_copy+0x14d/0x930
        do_wp_page+0x107/0x7b0
        __handle_mm_fault+0xce6/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
      
       write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
        page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
        try_charge+0x185/0xbf0 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
        __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
        __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
      
       read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
        page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
        try_charge+0x185/0xbf0 mm/memcontrol.c:2405
        __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
        __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
        __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
      
      Since watermark could be compared or set to garbage due to a data race
      which would change the code logic, fix it by adding a pair of READ_ONCE()
      and WRITE_ONCE() in those places.
      
      The "failcnt" counter is tolerant of some degree of inaccuracy and is only
      used to report stats, a data race will not be harmful, thus mark it as an
      intentional data race using the data_race() macro.
      
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Marco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e4bd50f
    • Q
      mm/swapfile: fix and annotate various data races · a449bf58
      Qian Cai 提交于
      swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
      be accessed concurrently separately as noticed by KCSAN,
      
      === si.highest_bit ===
      
       write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
        swap_range_alloc+0x81/0x130
        swap_range_alloc at mm/swapfile.c:681
        scan_swap_map_slots+0x371/0xb90
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
        scan_swap_map_slots+0x4a6/0xb90
        scan_swap_map_slots at mm/swapfile.c:892
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      === si.swap_map[offset] ===
      
       write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
        __swap_entry_free_locked+0x8c/0x100
        __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
        __swap_entry_free.constprop.20+0x69/0xb0
        free_swap_and_cache+0x53/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
       read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
        _swap_info_get+0x81/0xa0
        _swap_info_get at mm/swapfile.c:1140
        free_swap_and_cache+0x40/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
      === si.flags ===
      
       write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
        scan_swap_map_slots+0x6fe/0xb50
        scan_swap_map_slots at mm/swapfile.c:887
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0x377/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
        _swap_info_get+0x41/0xa0
        __swap_info_get at mm/swapfile.c:1114
        put_swap_page+0x84/0x490
        __remove_mapping+0x384/0x5f0
        shrink_page_list+0xff1/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
      The writes are under si->lock but the reads are not. For si.highest_bit
      and si.swap_map[offset], data race could trigger logic bugs, so fix them
      by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
      except those isolated reads where they compare against zero which a data
      race would cause no harm. Thus, annotate them as intentional data races
      using the data_race() macro.
      
      For si.flags, the readers are only interested in a single bit where a
      data race there would cause no issue there.
      
      [cai@lca.pw: add a missing annotation for si->flags in memory.c]
        Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a449bf58
    • K
      mm/filemap.c: fix a data race in filemap_fault() · e630bfac
      Kirill A. Shutemov 提交于
      struct file_ra_state ra.mmap_miss could be accessed concurrently during
      page faults as noticed by KCSAN,
      
       BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
      
       write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
        filemap_fault+0x920/0xfc0
        do_sync_mmap_readahead at mm/filemap.c:2384
        (inlined by) filemap_fault at mm/filemap.c:2486
        __xfs_filemap_fault+0x112/0x3e0 [xfs]
        xfs_filemap_fault+0x74/0x90 [xfs]
        __do_fault+0x9e/0x220
        do_fault+0x4a0/0x920
        __handle_mm_fault+0xc69/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
        filemap_map_pages+0xc2e/0xd80
        filemap_map_pages at mm/filemap.c:2625
        do_fault+0x3da/0x920
        __handle_mm_fault+0xc69/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      ra.mmap_miss is used to contribute the readahead decisions, a data race
      could be undesirable.  Both the read and write is only under non-exclusive
      mmap_sem, two concurrent writers could even underflow the counter.  Fix
      the underflow by writing to a local variable before committing a final
      store to ra.mmap_miss given a small inaccuracy of the counter should be
      acceptable.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Marco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e630bfac
    • Q
      mm/swap_state: mark various intentional data races · b96a3db2
      Qian Cai 提交于
      swap_cache_info.* could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache
      
       write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
        lookup_swap_cache+0x12e/0x460
        lookup_swap_cache at mm/swap_state.c:322
        do_swap_page+0x112/0xeb0
        __handle_mm_fault+0xc7a/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
        lookup_swap_cache+0x117/0x460
        lookup_swap_cache at mm/swap_state.c:322
        shmem_swapin_page+0xc7/0x9e0
        shmem_getpage_gfp+0x2ca/0x16c0
        shmem_fault+0xef/0x3c0
        __do_fault+0x9e/0x220
        do_fault+0x4a0/0x920
        __handle_mm_fault+0xc69/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G        W  O L 5.5.0-next-20200204+ #6
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
       write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
         __delete_from_swap_cache+0x681/0x8b0
         __delete_from_swap_cache at mm/swap_state.c:178
      
       read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
         __delete_from_swap_cache+0x66e/0x8b0
         __delete_from_swap_cache at mm/swap_state.c:178
      
      Both the read and write are done as lockless. Since swap_cache_info.*
      are only used to print out counter information, even if any of them
      missed a few incremental due to data races, it will be harmless, so just
      mark it as an intentional data race using the data_race() macro.
      
      While at it, fix a checkpatch.pl warning,
      
      WARNING: Single statement macros should not use a do {} while (0) loop
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b96a3db2
    • Q
      mm/page_io: mark various intentional data races · 7b37e226
      Qian Cai 提交于
      struct swap_info_struct si.flags could be accessed concurrently as noticed
      by KCSAN,
      
       BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage
      
       write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
        scan_swap_map_slots+0x6fe/0xb50
        scan_swap_map_slots at mm/swapfile.c:887
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0x377/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1740/0x2820
        shrink_inactive_list+0x316/0x8b0
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x170/0x700
        __handle_mm_fault+0xc9f/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
        swap_readpage+0x204/0x6a0
        swap_readpage at mm/page_io.c:380
        read_swap_cache_async+0xa2/0xb0
        swapin_readahead+0x6a0/0x890
        do_swap_page+0x465/0xeb0
        __handle_mm_fault+0xc7a/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 7 PID: 5422 Comm: gmain Tainted: G        W  O L 5.5.0-next-20200204+ #6
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      Other reads,
      
       read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
        __swap_writepage+0x140/0xc20
        __swap_writepage at mm/page_io.c:289
      
       read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
        swap_set_page_dirty+0x44/0x1f4
        swap_set_page_dirty at mm/page_io.c:442
      
      The write is under &si->lock, but the reads are done as lockless.  Since
      the reads only check for a specific bit in the flag, it is harmless even
      if load tearing happens.  Thus, just mark them as intentional data races
      using the data_race() macro.
      
      [cai@lca.pw: add a missing annotation]
        Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b37e226