1. 03 3月, 2021 1 次提交
  2. 25 2月, 2021 1 次提交
  3. 10 2月, 2021 1 次提交
  4. 28 1月, 2021 2 次提交
  5. 21 1月, 2021 1 次提交
  6. 16 12月, 2020 4 次提交
  7. 07 12月, 2020 1 次提交
  8. 14 10月, 2020 4 次提交
  9. 27 9月, 2020 1 次提交
    • G
      mm, THP, swap: fix allocating cluster for swapfile by mistake · 41663430
      Gao Xiang 提交于
      SWP_FS is used to make swap_{read,write}page() go through the
      filesystem, and it's only used for swap files over NFS.  So, !SWP_FS
      means non NFS for now, it could be either file backed or device backed.
      Something similar goes with legacy SWP_FILE.
      
      So in order to achieve the goal of the original patch, SWP_BLKDEV should
      be used instead.
      
      FS corruption can be observed with SSD device + XFS + fragmented
      swapfile due to CONFIG_THP_SWAP=y.
      
      I reproduced the issue with the following details:
      
      Environment:
      
        QEMU + upstream kernel + buildroot + NVMe (2 GB)
      
      Kernel config:
      
        CONFIG_BLK_DEV_NVME=y
        CONFIG_THP_SWAP=y
      
      Some reproducible steps:
      
        mkfs.xfs -f /dev/nvme0n1
        mkdir /tmp/mnt
        mount /dev/nvme0n1 /tmp/mnt
        bs="32k"
        sz="1024m"    # doesn't matter too much, I also tried 16m
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
      
        mkswap /tmp/mnt/sw
        swapon /tmp/mnt/sw
      
        stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
      
      Symptoms:
       - FS corruption (e.g. checksum failure)
       - memory corruption at: 0xd2808010
       - segfault
      
      Fixes: f0eea189 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
      Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Eric Sandeen <esandeen@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41663430
  10. 25 9月, 2020 2 次提交
  11. 24 9月, 2020 2 次提交
  12. 04 9月, 2020 1 次提交
    • S
      mm: Add arch hooks for saving/restoring tags · 8a84802e
      Steven Price 提交于
      Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
      every physical page, when swapping pages out to disk it is necessary to
      save these tags, and later restore them when reading the pages back.
      
      Add some hooks along with dummy implementations to enable the
      arch code to handle this.
      
      Three new hooks are added to the swap code:
       * arch_prepare_to_swap() and
       * arch_swap_invalidate_page() / arch_swap_invalidate_area().
      One new hook is added to shmem:
       * arch_swap_restore()
      Signed-off-by: NSteven Price <steven.price@arm.com>
      [catalin.marinas@arm.com: add unlock_page() on the error path]
      [catalin.marinas@arm.com: dropped the _tags suffix]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      8a84802e
  13. 15 8月, 2020 2 次提交
    • Q
      mm/swapfile: fix and annotate various data races · a449bf58
      Qian Cai 提交于
      swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
      be accessed concurrently separately as noticed by KCSAN,
      
      === si.highest_bit ===
      
       write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
        swap_range_alloc+0x81/0x130
        swap_range_alloc at mm/swapfile.c:681
        scan_swap_map_slots+0x371/0xb90
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
        scan_swap_map_slots+0x4a6/0xb90
        scan_swap_map_slots at mm/swapfile.c:892
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      === si.swap_map[offset] ===
      
       write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
        __swap_entry_free_locked+0x8c/0x100
        __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
        __swap_entry_free.constprop.20+0x69/0xb0
        free_swap_and_cache+0x53/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
       read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
        _swap_info_get+0x81/0xa0
        _swap_info_get at mm/swapfile.c:1140
        free_swap_and_cache+0x40/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
      === si.flags ===
      
       write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
        scan_swap_map_slots+0x6fe/0xb50
        scan_swap_map_slots at mm/swapfile.c:887
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0x377/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
        _swap_info_get+0x41/0xa0
        __swap_info_get at mm/swapfile.c:1114
        put_swap_page+0x84/0x490
        __remove_mapping+0x384/0x5f0
        shrink_page_list+0xff1/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
      The writes are under si->lock but the reads are not. For si.highest_bit
      and si.swap_map[offset], data race could trigger logic bugs, so fix them
      by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
      except those isolated reads where they compare against zero which a data
      race would cause no harm. Thus, annotate them as intentional data races
      using the data_race() macro.
      
      For si.flags, the readers are only interested in a single bit where a
      data race there would cause no issue there.
      
      [cai@lca.pw: add a missing annotation for si->flags in memory.c]
        Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a449bf58
    • M
      mm: replace hpage_nr_pages with thp_nr_pages · 6c357848
      Matthew Wilcox (Oracle) 提交于
      The thp prefix is more frequently used than hpage and we should be
      consistent between the various functions.
      
      [akpm@linux-foundation.org: fix mm/migrate.c]
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c357848
  14. 13 8月, 2020 2 次提交
  15. 01 7月, 2020 1 次提交
  16. 10 6月, 2020 2 次提交
    • M
      mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5
      Michel Lespinasse 提交于
      This change converts the existing mmap_sem rwsem calls to use the new mmap
      locking API instead.
      
      The change is generated using coccinelle with the following rule:
      
      // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
      
      @@
      expression mm;
      @@
      (
      -init_rwsem
      +mmap_init_lock
      |
      -down_write
      +mmap_write_lock
      |
      -down_write_killable
      +mmap_write_lock_killable
      |
      -down_write_trylock
      +mmap_write_trylock
      |
      -up_write
      +mmap_write_unlock
      |
      -downgrade_write
      +mmap_write_downgrade
      |
      -down_read
      +mmap_read_lock
      |
      -down_read_killable
      +mmap_read_lock_killable
      |
      -down_read_trylock
      +mmap_read_trylock
      |
      -up_read
      +mmap_read_unlock
      )
      -(&mm->mmap_sem)
      +(mm)
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Liam Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ying Han <yinghan@google.com>
      Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ed45c5
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  17. 04 6月, 2020 5 次提交
  18. 03 6月, 2020 7 次提交
    • R
      mm: swapfile: fix /proc/swaps heading and Size/Used/Priority alignment · 6f793940
      Randy Dunlap 提交于
      Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
      If the Size and/or Used value is >= 10000000 (8 bytes), then the
      alignment by using tab characters is broken.
      
      This patch maintains the use of tabs for alignment.  If spaces are
      preferred, we can just use a Field Width specifier for the bytes and
      inuse fields.  That way those fields don't have to be a multiple of 8
      bytes in width.  E.g., with a field width of 12, both Size and Used
      would always fit on the first line of an 80-column wide terminal (only
      Priority would be on the second line).
      
      There are actually 2 problems: heading alignment and field width.  On an
      xterm, if Used is 7 bytes in length, the tab does nothing, and the
      display is like this, with no space/tab between the Used and Priority
      fields.  (ugh)
      
      Filename				Type		Size	Used	Priority
      /dev/sda8                               partition	16779260	2023012-1
      
      To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
      different, like so:
      
      Filename				Type		Size	Used	Priority
      /dev/sda8                               partition	16779260	2086988	-1
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f793940
    • H
      swap: reduce lock contention on swap cache from swap slots allocation · 49070588
      Huang Ying 提交于
      In some swap scalability test, it is found that there are heavy lock
      contention on swap cache even if we have split one swap cache radix tree
      per swap device to one swap cache radix tree every 64 MB trunk in commit
      4b3ef9da ("mm/swap: split swap cache into 64MB trunks").
      
      The reason is as follow.  After the swap device becomes fragmented so
      that there's no free swap cluster, the swap device will be scanned
      linearly to find the free swap slots.  swap_info_struct->cluster_next is
      the next scanning base that is shared by all CPUs.  So nearby free swap
      slots will be allocated for different CPUs.  The probability for
      multiple CPUs to operate on the same 64 MB trunk is high.  This causes
      the lock contention on the swap cache.
      
      To solve the issue, in this patch, for SSD swap device, a percpu version
      next scanning base (cluster_next_cpu) is added.  Every CPU will use its
      own per-cpu next scanning base.  And after finishing scanning a 64MB
      trunk, the per-cpu scanning base will be changed to the beginning of
      another randomly selected 64MB trunk.  In this way, the probability for
      multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
      Thus the lock contention is reduced too.  For HDD, because sequential
      access is more important for IO performance, the original shared next
      scanning base is used.
      
      To test the patch, we have run 16-process pmbench memory benchmark on a
      2-socket server machine with 48 cores.  One ram disk is configured as the
      swap device per socket.  The pmbench working-set size is much larger than
      the available memory so that swapping is triggered.  The memory read/write
      ratio is 80/20 and the accessing pattern is random.  In the original
      implementation, the lock contention on the swap cache is heavy.  The perf
      profiling data of the lock contention code path is as following,
      
       _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
       _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93
      
      After applying this patch, it becomes,
      
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19
      
      The lock contention on the swap cache is almost eliminated.
      
      And the pmbench score increases 18.5%.  The swapin throughput increases
      18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout throughput increases
      18.5% from 2.99 GB/s to 3.54 GB/s.
      
      We need really fast disk to show the benefit.  I have tried this on 2
      Intel P3600 NVMe disks.  The performance improvement is only about 1%.
      The improvement should be better on the faster disks, such as Intel Optane
      disk.
      
      [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
        Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
      [ying.huang@intel.com: v4]
        Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49070588
    • H
      mm/swapfile.c: use prandom_u32_max() · 09fe06ce
      Huang Ying 提交于
      To improve the code readability and take advantage of the common
      implementation.
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09fe06ce
    • W
      33e16272
    • H
      swap: try to scan more free slots even when fragmented · ed43af10
      Huang Ying 提交于
      Now, the scalability of swap code will drop much when the swap device
      becomes fragmented, because the swap slots allocation batching stops
      working.  To solve the problem, in this patch, we will try to scan a
      little more swap slots with restricted effort to batch the swap slots
      allocation even if the swap device is fragmented.  Test shows that the
      benchmark score can increase up to 37.1% with the patch.  Details are as
      follows.
      
      The swap code has a per-cpu cache of swap slots.  These batch swap space
      allocations to improve swap subsystem scaling.  In the following code
      path,
      
        add_to_swap()
          get_swap_page()
            refill_swap_slots_cache()
              get_swap_pages()
      	  scan_swap_map_slots()
      
      scan_swap_map_slots() and get_swap_pages() can return multiple swap
      slots for each call.  These slots will be cached in the per-CPU swap
      slots cache, so that several following swap slot requests will be
      fulfilled there to avoid the lock contention in the lower level swap
      space allocation/freeing code path.
      
      But this only works when there are free swap clusters.  If a swap device
      becomes so fragmented that there's no free swap clusters,
      scan_swap_map_slots() and get_swap_pages() will return only one swap
      slot for each call in the above code path.  Effectively, this falls back
      to the situation before the swap slots cache was introduced, the heavy
      lock contention on the swap related locks kills the scalability.
      
      Why does it work in this way? Because the swap device could be large,
      and the free swap slot scanning could be quite time consuming, to avoid
      taking too much time to scanning free swap slots, the conservative
      method was used.
      
      In fact, this can be improved via scanning a little more free slots with
      strictly restricted effort.  Which is implemented in this patch.  In
      scan_swap_map_slots(), after the first free swap slot is gotten, we will
      try to scan a little more, but only if we haven't scanned too many slots
      (< LATENCY_LIMIT).  That is, the added scanning latency is strictly
      restricted.
      
      To test the patch, we have run 16-process pmbench memory benchmark on a
      2-socket server machine with 48 cores.  Multiple ram disks are
      configured as the swap devices.  The pmbench working-set size is much
      larger than the available memory so that swapping is triggered.  The
      memory read/write ratio is 80/20 and the accessing pattern is random, so
      the swap space becomes highly fragmented during the test.  In the
      original implementation, the lock contention on swap related locks is
      very heavy.  The perf profiling data of the lock contention code path is
      as following,
      
       _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69
      
      While after applying this patch, it becomes,
      
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
       _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88
      
      That is, the lock contention on the swap locks is eliminated.
      
      And the pmbench score increases 37.1%.  The swapin throughput increases
      45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
      45.3% from 2.04 GB/s to 2.97 GB/s.
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed43af10
    • W
      mm/swapfile.c: omit a duplicate code by compare tmp and max first · 7b9e2de1
      Wei Yang 提交于
      There are two duplicate code to handle the case when there is no available
      swap entry.  To avoid this, we can compare tmp and max first and let the
      second guard do its job.
      
      No functional change is expected.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b9e2de1
    • W
      mm/swapfile.c: tmp is always smaller than max · fdff1deb
      Wei Yang 提交于
      If tmp is bigger or equal to max, we would jump to new_cluster.
      
      Return true directly.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdff1deb