1. 08 12月, 2022 1 次提交
    • C
      swapfile: fix soft lockup in scan_swap_map_slots · b2b6c3df
      Chen Wandun 提交于
      mainline inclusion
      from mainline-v6.1-rc7
      commit de1ccfb6
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I645DG
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de1ccfb648243a031cfbdc2d5571dfdaf5023106
      
      --------------------------------
      
      A softlockup occurs in scan free swap slot under huge memory pressure.
      The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
      disksize of each zram device is 50MB.
      
      LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
      the real loop number would more than LATENCY_LIMIT because of "goto checks
      and goto scan" repeatly without decreasing latency limit.
      
      In order to fix it, decrease latency_ration in advance.
      
      There is also a suspicious place that will cause softlockups in
      get_swap_pages().  In this function, the "goto start_over" may result in
      continuous scanning of the swap partition.  If there is no cond_sched in
      scan_swap_map_slots(), it would cause a softlockup (I am not sure about
      this).
      
      WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
      CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
      dump backtrace+0x0/0x1le4
      show stack+0x20/@x2c
      dump_stack+0xd8/0x140
      watchdog print_info+0x48/0x54
      watchdog_process_before_softlockup+0x98/0xa0
      watchdog_timer_fn+0xlac/0x2d0
      hrtimer_rum_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handLe_percpu_devid_irq+0x90/0x1f4
      handle domain irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      e11 ira+0xhB/Bx140
      scan_swap_map_slots+0x678/0x890
      get_swap_pages+0x29c/0x440
      get_swap_page+0x120/0x2e0
      add_to_swap+UX2U/0XyC
      shrink_page_list+0x5d0/0x152c
      shrink_inactive_list+0xl6c/Bx500
      shrink_lruvec+0x270/0x304
      
      WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
      watchdog_timer_fn+0x1ac/0x2d0
      __run_hrtimer+0x98/0x2a0
      __hrtimer_run_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handle_percpu_devid_irq+0x90/0x1f4
      __handle_domain_irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      el1_irq+0xb8/0x140
      get_swap_pages+0x1e8/0x440
      get_swap_page+0x1c8/0x2e0
      add_to_swap+0x20/0x9c
      shrink_page_list+0x5d0/0x152c
      reclaim_pages+0x160/0x310
      madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
      walk_pmd_range.isra.0+0xac/0x22c
      walk_pud_range+0xfc/0x1c0
      walk_pgd_range+0x158/0x1b0
      __walk_page_range+0x64/0x100
      walk_page_range+0x104/0x150
      
      Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
      Fixes: 048c27fd ("[PATCH] swap: scan_swap_map latency breaks")
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <xialonglong1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	mm/swapfile.c
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b2b6c3df
  2. 11 11月, 2022 1 次提交
  3. 28 6月, 2022 1 次提交
  4. 06 7月, 2021 1 次提交
  5. 09 4月, 2021 1 次提交
  6. 08 2月, 2021 1 次提交
  7. 07 12月, 2020 1 次提交
  8. 14 10月, 2020 4 次提交
  9. 27 9月, 2020 1 次提交
    • G
      mm, THP, swap: fix allocating cluster for swapfile by mistake · 41663430
      Gao Xiang 提交于
      SWP_FS is used to make swap_{read,write}page() go through the
      filesystem, and it's only used for swap files over NFS.  So, !SWP_FS
      means non NFS for now, it could be either file backed or device backed.
      Something similar goes with legacy SWP_FILE.
      
      So in order to achieve the goal of the original patch, SWP_BLKDEV should
      be used instead.
      
      FS corruption can be observed with SSD device + XFS + fragmented
      swapfile due to CONFIG_THP_SWAP=y.
      
      I reproduced the issue with the following details:
      
      Environment:
      
        QEMU + upstream kernel + buildroot + NVMe (2 GB)
      
      Kernel config:
      
        CONFIG_BLK_DEV_NVME=y
        CONFIG_THP_SWAP=y
      
      Some reproducible steps:
      
        mkfs.xfs -f /dev/nvme0n1
        mkdir /tmp/mnt
        mount /dev/nvme0n1 /tmp/mnt
        bs="32k"
        sz="1024m"    # doesn't matter too much, I also tried 16m
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
      
        mkswap /tmp/mnt/sw
        swapon /tmp/mnt/sw
      
        stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
      
      Symptoms:
       - FS corruption (e.g. checksum failure)
       - memory corruption at: 0xd2808010
       - segfault
      
      Fixes: f0eea189 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
      Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Eric Sandeen <esandeen@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41663430
  10. 25 9月, 2020 2 次提交
  11. 24 9月, 2020 2 次提交
  12. 04 9月, 2020 1 次提交
    • S
      mm: Add arch hooks for saving/restoring tags · 8a84802e
      Steven Price 提交于
      Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
      every physical page, when swapping pages out to disk it is necessary to
      save these tags, and later restore them when reading the pages back.
      
      Add some hooks along with dummy implementations to enable the
      arch code to handle this.
      
      Three new hooks are added to the swap code:
       * arch_prepare_to_swap() and
       * arch_swap_invalidate_page() / arch_swap_invalidate_area().
      One new hook is added to shmem:
       * arch_swap_restore()
      Signed-off-by: NSteven Price <steven.price@arm.com>
      [catalin.marinas@arm.com: add unlock_page() on the error path]
      [catalin.marinas@arm.com: dropped the _tags suffix]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      8a84802e
  13. 15 8月, 2020 2 次提交
    • Q
      mm/swapfile: fix and annotate various data races · a449bf58
      Qian Cai 提交于
      swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
      be accessed concurrently separately as noticed by KCSAN,
      
      === si.highest_bit ===
      
       write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
        swap_range_alloc+0x81/0x130
        swap_range_alloc at mm/swapfile.c:681
        scan_swap_map_slots+0x371/0xb90
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
        scan_swap_map_slots+0x4a6/0xb90
        scan_swap_map_slots at mm/swapfile.c:892
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      === si.swap_map[offset] ===
      
       write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
        __swap_entry_free_locked+0x8c/0x100
        __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
        __swap_entry_free.constprop.20+0x69/0xb0
        free_swap_and_cache+0x53/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
       read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
        _swap_info_get+0x81/0xa0
        _swap_info_get at mm/swapfile.c:1140
        free_swap_and_cache+0x40/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
      === si.flags ===
      
       write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
        scan_swap_map_slots+0x6fe/0xb50
        scan_swap_map_slots at mm/swapfile.c:887
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0x377/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
        _swap_info_get+0x41/0xa0
        __swap_info_get at mm/swapfile.c:1114
        put_swap_page+0x84/0x490
        __remove_mapping+0x384/0x5f0
        shrink_page_list+0xff1/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
      The writes are under si->lock but the reads are not. For si.highest_bit
      and si.swap_map[offset], data race could trigger logic bugs, so fix them
      by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
      except those isolated reads where they compare against zero which a data
      race would cause no harm. Thus, annotate them as intentional data races
      using the data_race() macro.
      
      For si.flags, the readers are only interested in a single bit where a
      data race there would cause no issue there.
      
      [cai@lca.pw: add a missing annotation for si->flags in memory.c]
        Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a449bf58
    • M
      mm: replace hpage_nr_pages with thp_nr_pages · 6c357848
      Matthew Wilcox (Oracle) 提交于
      The thp prefix is more frequently used than hpage and we should be
      consistent between the various functions.
      
      [akpm@linux-foundation.org: fix mm/migrate.c]
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c357848
  14. 13 8月, 2020 2 次提交
  15. 01 7月, 2020 1 次提交
  16. 10 6月, 2020 2 次提交
    • M
      mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5
      Michel Lespinasse 提交于
      This change converts the existing mmap_sem rwsem calls to use the new mmap
      locking API instead.
      
      The change is generated using coccinelle with the following rule:
      
      // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
      
      @@
      expression mm;
      @@
      (
      -init_rwsem
      +mmap_init_lock
      |
      -down_write
      +mmap_write_lock
      |
      -down_write_killable
      +mmap_write_lock_killable
      |
      -down_write_trylock
      +mmap_write_trylock
      |
      -up_write
      +mmap_write_unlock
      |
      -downgrade_write
      +mmap_write_downgrade
      |
      -down_read
      +mmap_read_lock
      |
      -down_read_killable
      +mmap_read_lock_killable
      |
      -down_read_trylock
      +mmap_read_trylock
      |
      -up_read
      +mmap_read_unlock
      )
      -(&mm->mmap_sem)
      +(mm)
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Liam Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ying Han <yinghan@google.com>
      Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ed45c5
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  17. 04 6月, 2020 5 次提交
  18. 03 6月, 2020 11 次提交