1. 06 3月, 2019 40 次提交
    • M
      mm, compaction: shrink compact_control · c5fbd937
      Mel Gorman 提交于
      Patch series "Increase success rates and reduce latency of compaction", v3.
      
      This series reduces scan rates and success rates of compaction,
      primarily by using the free lists to shorten scans, better controlling
      of skip information and whether multiple scanners can target the same
      block and capturing pageblocks before being stolen by parallel requests.
      The series is based on mmotm from January 9th, 2019 with the previous
      compaction series reverted.
      
      I'm mostly using thpscale to measure the impact of the series.  The
      benchmark creates a large file, maps it, faults it, punches holes in the
      mapping so that the virtual address space is fragmented and then tries
      to allocate THP.  It re-executes for different numbers of threads.  From
      a fragmentation perspective, the workload is relatively benign but it
      does stress compaction.
      
      The overall impact on latencies for a 1-socket machine is
      
      				      baseline		      patches
      Amean     fault-both-3      3832.09 (   0.00%)     2748.56 *  28.28%*
      Amean     fault-both-5      4933.06 (   0.00%)     4255.52 (  13.73%)
      Amean     fault-both-7      7017.75 (   0.00%)     6586.93 (   6.14%)
      Amean     fault-both-12    11610.51 (   0.00%)     9162.34 *  21.09%*
      Amean     fault-both-18    17055.85 (   0.00%)    11530.06 *  32.40%*
      Amean     fault-both-24    19306.27 (   0.00%)    17956.13 (   6.99%)
      Amean     fault-both-30    22516.49 (   0.00%)    15686.47 *  30.33%*
      Amean     fault-both-32    23442.93 (   0.00%)    16564.83 *  29.34%*
      
      The allocation success rates are much improved
      
      			 	 baseline		 patches
      Percentage huge-3        85.99 (   0.00%)       97.96 (  13.92%)
      Percentage huge-5        88.27 (   0.00%)       96.87 (   9.74%)
      Percentage huge-7        85.87 (   0.00%)       94.53 (  10.09%)
      Percentage huge-12       82.38 (   0.00%)       98.44 (  19.49%)
      Percentage huge-18       83.29 (   0.00%)       99.14 (  19.04%)
      Percentage huge-24       81.41 (   0.00%)       97.35 (  19.57%)
      Percentage huge-30       80.98 (   0.00%)       98.05 (  21.08%)
      Percentage huge-32       80.53 (   0.00%)       97.06 (  20.53%)
      
      That's a nearly perfect allocation success rate.
      
      The biggest impact is on the scan rates
      
      Compaction migrate scanned    55893379    19341254
      Compaction free scanned      474739990    11903963
      
      The number of pages scanned for migration was reduced by 65% and the
      free scanner was reduced by 97.5%.  So much less work in exchange for
      lower latency and better success rates.
      
      The series was also evaluated using a workload that heavily fragments
      memory but the benefits there are also significant, albeit not
      presented.
      
      It was commented that we should be rethinking scanning entirely and to a
      large extent I agree.  However, to achieve that you need a lot of this
      series in place first so it's best to make the linear scanners as best
      as possible before ripping them out.
      
      This patch (of 22):
      
      The isolate and migrate scanners should never isolate more than a
      pageblock of pages so unsigned int is sufficient saving 8 bytes on a
      64-bit build.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fbd937
    • Z
      mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page · 35f12f0f
      zhengbin 提交于
      The 'end_byte' parameter of filemap_range_has_page is required to be
      inclusive, so follow the rule.
      
      Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
      Fixes: 6be96d3a ("fs: return if direct I/O will trigger writeback")
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: zhangyi (F) <yi.zhang@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f12f0f
    • A
      mm: shuffle GFP_* flags · d71e53ce
      Alexey Dobriyan 提交于
      GFP_KERNEL is one of the most used constant but on archs like arm with
      fixed length instruction some constants are more equal than the others.
      Constants with tightly packed bits can be injected directly into
      instruction stream:
      
      	   0:   e3a00d33        mov     r0, #3264       ; 0xcc0
      
      Others require multiple instructions or even loading out of instruction
      stream:
      
      	   0:   e3a000c0        mov     r0, #192        ; 0xc0
      	   4:   e3400060        movt    r0, #96		; 0x60
      
      Shuffle GFP_* flags so that GFP_KERNEL/GFP_ATOMIC + __GFP_ZERO bits are
      close to each other.
      
      Savings on arm configs are ~0.1%.
      
      Link: http://lkml.kernel.org/r/20190109201838.GA9140@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d71e53ce
    • Y
      mm: swap: add comment for swap_vma_readahead · e9f59873
      Yang Shi 提交于
      swap_vma_readahead()'s comment is missing, just add it.
      
      Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9f59873
    • Y
      mm: swap: check if swap backing device is congested or not · 8fd2e0b5
      Yang Shi 提交于
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fd2e0b5
    • M
      mm/filemap.c: remove redundant test from find_get_pages_contig · 14ef1fc7
      Matthew Wilcox 提交于
      After we establish a reference on the page, we check the pointer
      continues to be in the correct position in i_pages.  Checking
      page->index afterwards is unnecessary; if it were to change, then the
      pointer to it from the page cache would also move.  The check used to be
      done before grabbing a reference on the page which was racy (see commit
      9cbb4cb2 ("mm: find_get_pages_contig fixlet")), but nobody noticed
      that moving the check after grabbing the reference was redundant.
      
      Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14ef1fc7
    • G
      mm/memcontrol.c: use struct_size() in kmalloc() · 67b8046f
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array.  For example:
      
        struct foo {
            int stuff;
            void *entry[];
        };
      
        instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
        instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      
      Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedorSigned-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67b8046f
    • W
      mm: remove extra drain pages on pcp list · c52e7593
      Wei Yang 提交于
      In the current implementation, there are two places to isolate a range
      of page: __offline_pages() and alloc_contig_range().  During this
      procedure, it will drain pages on pcp list.
      
      Below is a brief call flow:
      
        __offline_pages()/alloc_contig_range()
            start_isolate_page_range()
                set_migratetype_isolate()
                    drain_all_pages()
            drain_all_pages()                 <--- A
      
      This snippet shows the current logic is isolate and drain pcp list for
      each pageblock and drain pcp list again for the whole range.
      
      start_isolate_page_range is responsible for isolating the given pfn
      range.  One part of that job is to make sure that also pages that are on
      the allocator pcp lists are properly isolated.  Otherwise they could be
      reused and the range wouldn't be completely isolated until the memory is
      freed back.  While there is no strict guarantee here because pages might
      get allocated at any time before drain_all_pages is called there doesn't
      seem to be any strong demand for such a guarantee.
      
      In any case, draining is already done at the isolation level and there
      is no need to do it again later by start_isolate_page_range callers
      (memory hotplug and CMA allocator currently).  Therefore remove
      pointless draining in existing callers to make the code more clear and
      functionally correct.
      
      [mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
      Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c52e7593
    • A
      arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages · 5480280d
      Anshuman Khandual 提交于
      Let arm64 subscribe to the previously added framework in which
      architecture can inform whether a given huge page size is supported for
      migration.  This just overrides the default function
      arch_hugetlb_migration_supported() and enables migration for all
      possible HugeTLB page sizes on arm64.
      
      With this, HugeTLB migration support on arm64 now covers all possible
      HugeTLB options.
      
                CONT PTE    PMD    CONT PMD    PUD
                --------    ---    --------    ---
        4K:        64K      2M        32M      1G
        16K:        2M     32M         1G
        64K:        2M    512M        16G
      
      Link: http://lkml.kernel.org/r/1545121450-1663-6-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5480280d
    • A
      arm64/mm: enable HugeTLB migration · 4a03a058
      Anshuman Khandual 提交于
      Let arm64 subscribe to generic HugeTLB page migration framework.  Right
      now this only works on the following PMD and PUD level HugeTLB page
      sizes with various kernel base page size combinations.
      
               CONT PTE    PMD    CONT PMD    PUD
               --------    ---    --------    ---
        4K:         NA     2M         NA      1G
        16K:        NA    32M         NA
        64K:        NA   512M         NA
      
      Link: http://lkml.kernel.org/r/1545121450-1663-5-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a03a058
    • A
      mm/hugetlb: enable arch specific huge page size support for migration · e693de18
      Anshuman Khandual 提交于
      Architectures like arm64 have HugeTLB page sizes which are different
      than generic sizes at PMD, PUD, PGD level and implemented via contiguous
      bits.  At present these special size HugeTLB pages cannot be identified
      through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
      migrated.
      
      Enabling migration support for these special HugeTLB page sizes along
      with the generic ones (PMD|PUD|PGD) would require identifying all of
      them on a given platform.  A platform specific hook can precisely
      enumerate all huge page sizes supported for migration.  Instead of
      comparing against standard huge page orders let
      hugetlb_migration_support() function call a platform hook
      arch_hugetlb_migration_support().  Default definition for the platform
      hook maintains existing semantics which checks standard huge page order.
      But an architecture can choose to override the default and provide
      support for a comprehensive set of huge page sizes.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e693de18
    • A
      mm/hugetlb: enable PUD level huge page migration · 9b553bf5
      Anshuman Khandual 提交于
      Architectures like arm64 have PUD level HugeTLB pages for certain configs
      (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
      be enabled for migration.  It can be achieved through checking for
      PUD_SHIFT order based HugeTLB pages during migration.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b553bf5
    • A
      mm/hugetlb: distinguish between migratability and movability · 7ed2c31d
      Anshuman Khandual 提交于
      Patch series "arm64/mm: Enable HugeTLB migration", v4.
      
      This patch series enables HugeTLB migration support for all supported
      huge page sizes at all levels including contiguous bit implementation.
      Following HugeTLB migration support matrix has been enabled with this
      patch series.  All permutations have been tested except for the 16GB.
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
        4K:         64K     2M         32M     1G
        16K:         2M    32M          1G
        64K:         2M   512M         16G
      
      First the series adds migration support for PUD based huge pages.  It
      then adds a platform specific hook to query an architecture if a given
      huge page size is supported for migration while also providing a default
      fallback option preserving the existing semantics which just checks for
      (PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
      migration on arm64 and subscribe to this new platform specific hook by
      defining an override.
      
      The second patch differentiates between movability and migratability
      aspects of huge pages and implements hugepage_movable_supported() which
      can then be used during allocation to decide whether to place the huge
      page in movable zone or not.
      
      This patch (of 5):
      
      During huge page allocation it's migratability is checked to determine
      if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
      But the movability aspect of the huge page could depend on other factors
      than just migratability.  Movability in itself is a distinct property
      which should not be tied with migratability alone.
      
      This differentiates these two and implements an enhanced movability check
      which also considers huge page size to determine if it is feasible to be
      placed under a movable zone.  At present it just checks for gigantic pages
      but going forward it can incorporate other enhanced checks.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2c31d
    • M
      mm: remove sysctl_extfrag_handler() · 6b7e5cad
      Matthew Wilcox 提交于
      sysctl_extfrag_handler() neglects to propagate the return value from
      proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
      to exist, so just use proc_dointvec_minmax() directly.
      
      Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reported-by: NAditya Pakki <pakki001@umn.edu>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b7e5cad
    • U
      selftests/vm: add script helper for CONFIG_TEST_VMALLOC_MODULE · a05ef00c
      Uladzislau Rezki (Sony) 提交于
      Add the test script for the kernel test driver to analyse vmalloc
      allocator for benchmarking and stressing purposes.  It is just a kernel
      module loader.  You can specify and pass different parameters in order
      to investigate allocations behaviour.  See "usage" output for more
      details.
      
      Also add basic vmalloc smoke test to the "run_vmtests" suite.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NShuah Khan <shuah@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a05ef00c
    • U
      vmalloc: add test driver to analyse vmalloc allocator · 3f21a6b7
      Uladzislau Rezki (Sony) 提交于
      This adds a new kernel module for analysis of vmalloc allocator.  It is
      only enabled as a module.  There are two main reasons this module should
      be used for: performance evaluation and stressing of vmalloc subsystem.
      
      It consists of several test cases.  As of now there are 8.  The module
      has five parameters we can specify to change its the behaviour.
      
      1) run_test_mask - set of tests to be run
      
      id: 1,   name: fix_size_alloc_test
      id: 2,   name: full_fit_alloc_test
      id: 4,   name: long_busy_list_alloc_test
      id: 8,   name: random_size_alloc_test
      id: 16,  name: fix_align_alloc_test
      id: 32,  name: random_size_align_alloc_test
      id: 64,  name: align_shift_alloc_test
      id: 128, name: pcpu_alloc_test
      
      By default all tests are in run test mask.  If you want to select some
      specific tests it is possible to pass the mask.  For example for first,
      second and fourth tests we go 11 value.
      
      2) test_repeat_count - how many times each test should be repeated
      By default it is one time per test. It is possible to pass any number.
      As high the value is the test duration gets increased.
      
      3) test_loop_count - internal test loop counter. By default it is set
      to 1000000.
      
      4) single_cpu_test - use one CPU to run the tests
      By default this parameter is set to false. It means that all online
      CPUs execute tests. By setting it to 1, the tests are executed by
      first online CPU only.
      
      5) sequential_test_order - run tests in sequential order
      By default this parameter is set to false. It means that before running
      tests the order is shuffled. It is possible to make it sequential, just
      set it to 1.
      
      Performance analysis:
      In order to evaluate performance of vmalloc allocations, usually it
      makes sense to use only one CPU that runs tests, use sequential order,
      number of repeat tests can be different as well as set of test mask.
      
      For example if we want to run all tests, to use one CPU and repeat each
      test 3 times. Insert the module passing following parameters:
      
      single_cpu_test=1 sequential_test_order=1 test_repeat_count=3
      
      with following output:
      
      <snip>
      Summary: fix_size_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 901177 usec
      Summary: full_fit_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 1039341 usec
      Summary: long_busy_list_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 11775763 usec
      Summary: random_size_alloc_test passed 3: failed: 0 repeat: 3 loops: 1000000 avg: 6081992 usec
      Summary: fix_align_alloc_test passed: 3 failed: 0 repeat: 3, loops: 1000000 avg: 2003712 usec
      Summary: random_size_align_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 2895689 usec
      Summary: align_shift_alloc_test passed: 0 failed: 3 repeat: 3 loops: 1000000 avg: 573 usec
      Summary: pcpu_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 95802 usec
      All test took CPU0=192945605995 cycles
      <snip>
      
      The align_shift_alloc_test is expected to be failed.
      
      Stressing:
      In order to stress the vmalloc subsystem we run all available test cases
      on all available CPUs simultaneously. In order to prevent constant behaviour
      pattern, the test cases array is shuffled by default to randomize the order
      of test execution.
      
      For example if we want to run all tests(default), use all online CPUs(default)
      with shuffled order(default) and to repeat each test 30 times. The command
      would be like:
      
      modprobe vmalloc_test test_repeat_count=30
      
      Expected results are the system is alive, there are no any BUG_ONs or Kernel
      Panics the tests are completed, no memory leaks.
      
      [urezki@gmail.com: fix 32-bit builds]
        Link: http://lkml.kernel.org/r/20190106214839.ffvjvmrn52uqog7k@pc636
      [urezki@gmail.com: make CONFIG_TEST_VMALLOC depend on CONFIG_MMU]
        Link: http://lkml.kernel.org/r/20190219085441.s6bg2gpy4esny5vw@pc636
      Link: http://lkml.kernel.org/r/20190103142108.20744-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f21a6b7
    • U
      vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE · 153178ed
      Uladzislau Rezki (Sony) 提交于
      Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
      enabled.  Some test cases in vmalloc test suite module require and make
      use of that function.  Please note, that it is not supposed to be used
      for other purposes.
      
      We need it only for performance analysis, stressing and stability check
      of vmalloc allocator.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      153178ed
    • R
      mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range() · bc84c535
      Roman Penyaev 提交于
      vmalloc_user*() calls differ from normal vmalloc() only in that they set
      VM_USERMAP flags for the area.  During the whole history of vmalloc.c
      changes now it is possible simply to pass VM_USERMAP flags directly to
      __vmalloc_node_range() call instead of finding the area (which obviously
      takes time) after the allocation.
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc84c535
    • R
      mm/vmalloc: do not call kmemleak_free() on not yet accounted memory · c67dc624
      Roman Penyaev 提交于
      __vmalloc_area_node() calls vfree() on error path, which in turn calls
      kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67dc624
    • R
      mm/vmalloc: fix size check for remap_vmalloc_range_partial() · 401592d2
      Roman Penyaev 提交于
      When VM_NO_GUARD is not set area->size includes adjacent guard page,
      thus for correct size checking get_vm_area_size() should be used, but
      not area->size.
      
      This fixes possible kernel oops when userspace tries to mmap an area on
      1 page bigger than was allocated by vmalloc_user() call: the size check
      inside remap_vmalloc_range_partial() accounts non-existing guard page
      also, so check successfully passes but vmalloc_to_page() returns NULL
      (guard page does not physically exist).
      
      The following code pattern example should trigger an oops:
      
        static int oops_mmap(struct file *file, struct vm_area_struct *vma)
        {
              void *mem;
      
              mem = vmalloc_user(4096);
              BUG_ON(!mem);
              /* Do not care about mem leak */
      
              return remap_vmalloc_range(vma, mem, 0);
        }
      
      And userspace simply mmaps size + PAGE_SIZE:
      
        mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
      
      Possible candidates for oops which do not have any explicit size
      checks:
      
         *** drivers/media/usb/stkwebcam/stk-webcam.c:
         v4l_stk_mmap[789]   ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
      
      Or the following one:
      
         *** drivers/video/fbdev/core/fbmem.c
         static int
         fb_mmap(struct file *file, struct vm_area_struct * vma)
              ...
              res = fb->fb_mmap(info, vma);
      
      Where fb_mmap callback calls remap_vmalloc_range() directly without any
      explicit checks:
      
         *** drivers/video/fbdev/vfb.c
         static int vfb_mmap(struct fb_info *info,
                   struct vm_area_struct *vma)
         {
             return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
         }
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      401592d2
    • R
      mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA · 5a82ac71
      Roman Penyaev 提交于
      This patch repeats the original one from David S Miller:
      
        2dca6999 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")
      
      but for missed vmalloc_32_user() case, which also requires correct
      alignment of virtual address on kernel side to avoid D-caches aliases.
      A bit of copy-paste from original patch to recover in memory of what is
      all about:
      
        When a vmalloc'd area is mmap'd into userspace, some kind of
        co-ordination is necessary for this to work on platforms with cpu
        D-caches which can have aliases.
      
        Otherwise kernel side writes won't be seen properly in userspace and
        vice versa.
      
        If the kernel side mapping and the user side one have the same
        alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
        of VMA and for all current users this is true. VM_SHARED will force
        SHMLBA alignment of the user side mmap on platforms with D-cache
        aliasing matters.
      
        David S. Miller
      
      > What are the user-visible runtime effects of this change?
      
      In simple words: proper alignment avoids possible difference in data,
      seen by different virtual mapings: userspace and kernel in our case.
      I.e. userspace reads cache line A, kernel writes to cache line B.  Both
      cache lines correspond to the same physical memory (thus aliases).
      
      So this should fix data corruption for archs with vivt and vipt caches,
      e.g. armv6.  Personally I've never worked with this archs, I just
      spotted the strange difference in code: for one case we do alignment,
      for another - not.  I have a strong feeling that David simply missed
      vmalloc_32_user() case.
      
      >
      > Is a -stable backport needed?
      
      No, I do not think so.  The only one user of vmalloc_32_user() is
      virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
      description "The main use of this frame buffer device is testing and
      debugging the frame buffer subsystem.  Do NOT enable it for normal
      systems!".
      
      And it seems to me that this vfb.c does not need 32bit addressable pages
      (vmalloc_32_user() case), because it is virtual device and should not
      care about things like dma32 zones, etc.  Probably is better to clean
      the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
      and wipe out vmalloc_32_user() from vmalloc.c completely.  But I'm not
      very much sure that this is worth to do, that's so minor, so we can
      leave it as is.
      
      Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a82ac71
    • S
      memcg: localize memcg_kmem_enabled() check · 60cd4bcd
      Shakeel Butt 提交于
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60cd4bcd
    • W
      mm, slub: make the comment of put_cpu_partial() complete · 9234bae9
      Wei Yang 提交于
      There are two cases when put_cpu_partial() is invoked.
      
          * __slab_free
          * get_partial_node
      
      This patch just makes it cover these two cases.
      
      Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9234bae9
    • K
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai 提交于
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • S
      tools/: replace open encodings for NUMA_NO_NODE · 7c9eefe8
      Stephen Rothwell 提交于
      This replaces all open encodings in tools with NUMA_NO_NODE.  Also
      linux/numa.h is now needed for the perf build.
      
      [sfr@canb.auug.org.au: fix for replace open encodings for NUMA_NO_NODE]
        Link: http://lkml.kernel.org/r/20190108131141.730e9c4f@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1545127933-10711-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Cc: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c9eefe8
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • L
      mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap() · 6ade2032
      Liviu Dudau 提交于
      find_vmap_area() can return a NULL pointer and we're going to
      dereference it without checking it first.  Use the existing
      find_vm_area() function which does exactly what we want and checks for
      the NULL pointer.
      
      Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
      Fixes: f3c01d2f ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
      Signed-off-by: NLiviu Dudau <liviu@dudau.co.uk>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ade2032
    • D
      PM/Hibernate: exclude all PageOffline() pages · abd02ac6
      David Hildenbrand 提交于
      The content of pages that are marked PG_offline is not of interest (e.g.
      inflated by a balloon driver), let's skip these pages.
      
      In saveable_highmem_page(), move the PageReserved() check to a new check
      along with the PageOffline() check to separate it from the swsusp
      checks.
      
      [david@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20181122100627.5189-9-david@redhat.com
      Link: http://lkml.kernel.org/r/20181119101616.8901-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd02ac6
    • D
      PM/Hibernate: use pfn_to_online_page() · 5b56db37
      David Hildenbrand 提交于
      Let's use pfn_to_online_page() instead of pfn_to_page() when checking
      for saveable pages to not save/restore offline memory sections.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-8-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b56db37
    • D
      vmw_balloon: mark inflated pages PG_offline · 8165540c
      David Hildenbrand 提交于
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      [david@redhat.com: use vmballoon_page_in_frames more widely]
        Link: http://lkml.kernel.org/r/20181122100627.5189-7-david@redhat.com
      Link: http://lkml.kernel.org/r/20181119101616.8901-7-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNadav Amit <namit@vmware.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8165540c
    • D
      hv_balloon: mark inflated pages PG_offline · fae42c4d
      David Hildenbrand 提交于
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae42c4d
    • D
      xen/balloon: mark inflated pages PG_offline · 77c4adf6
      David Hildenbrand 提交于
      Mark inflated and never onlined pages PG_offline, to tell the world that
      the content is stale and should not be dumped.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77c4adf6
    • D
      kexec: export PG_offline to VMCOREINFO · e04b742f
      David Hildenbrand 提交于
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      other balloon inflated memory will essentially result in zero pages
      getting allocated by the hypervisor and the dump getting filled with
      this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      We now have PG_offline which can be (and already is by virtio-balloon)
      used for marking pages as logically offline.  Follow up patches will
      make use of this flag also in other balloon implementations.
      
      Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile
      can directly skip pages that are logically offline and the content
      therefore stale.
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e04b742f
    • D
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand 提交于
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
    • D
      mm: balloon: update comment about isolation/migration/compaction · 4d3467e1
      David Hildenbrand 提交于
      Patch series "mm/kdump: allow to exclude pages that are logically
      offline"
      
      Right now, pages inflated as part of a balloon driver will be dumped by
      dump tools like makedumpfile.  While XEN is able to check in the crash
      kernel whether a certain pfn is actuall backed by memory in the
      hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
      virtio-balloon, hv-balloon and VMWare balloon inflated memory will
      essentially result in zero pages getting allocated by the hypervisor and
      the dump getting filled with this data.
      
      The allocation and reading of zero pages can directly be avoided if a
      dumping tool could know which pages only contain stale information not
      to be dumped.
      
      Also for XEN, calling into the kernel and asking the hypervisor if a pfn
      is backed can be avoided if the duming tool would skip such pages right
      from the beginning.
      
      Dumping tools have no idea whether a given page is part of a balloon
      driver and shall not be dumped.  Esp.  PG_reserved cannot be used for
      that purpose as all memory allocated during early boot is also
      PG_reserved, see discussion at [1].  So some other way of indication is
      required and a new page flag is frowned upon.
      
      We have PG_balloon (MAPCOUNT value), which is essentially unused now.  I
      suggest renaming it to something more generic (PG_offline) to mark pages
      as logically offline.  This flag can than e.g.  also be used by
      virtio-mem in the future to mark subsections as offline.  Or by other
      code that wants to put pages logically offline (e.g.  later maybe
      poisoned pages that shall no longer be used).
      
      This series converts PG_balloon to PG_offline, allows dumping tools to
      query the value to detect such pages and marks pages in the hv-balloon
      and XEN balloon properly as PG_offline.  Note that virtio-balloon
      already set pages to PG_balloon (and now PG_offline).
      
      Please note that this is also helpful for a problem we were seeing under
      Hyper-V: Dumping logically offline memory (pages kept fake offline while
      onlining a section via online_page_callback) would under some condicions
      result in a kernel panic when dumping them.
      
      As I don't have access to neither XEN nor Hyper-V nor VMWare
      installations, this was only tested with the virtio-balloon and pages
      were properly skipped when dumping.  I'll also attach the makedumpfile
      patch to this series.
      
      [1] https://lkml.org/lkml/2018/7/20/566
      
      This patch (of 8):
      
      Commit b1123ea6 ("mm: balloon: use general non-lru movable page
      feature") reworked balloon handling to make use of the general non-lru
      movable page feature.  The big comment block in balloon_compaction.h
      contains quite some outdated information.  Let's fix this.
      
      Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d3467e1
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
    • Q
      mm/slub.c: remove an unused addr argument · 278d7756
      Qian Cai 提交于
      "addr" function argument is not used in alloc_consistency_checks() at
      all, so remove it.
      
      Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
      Fixes: becfda68 ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      278d7756
    • T
      include/linux/slub_def.h: comment fixes · de810f49
      Tobin C. Harding 提交于
      Capitialize comment string, use C89 comment style, correct
      grammar/punctuation in comments.
      
      Link: http://lkml.kernel.org/r/20190204005713.9463-2-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-3-tobin@kernel.org
      Link: http://lkml.kernel.org/r/20190204005713.9463-4-tobin@kernel.orgSigned-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de810f49
    • Q
      mm/slab.c: kmemleak no scan alien caches · 92d1d07d
      Qian Cai 提交于
      Kmemleak throws endless warnings during boot due to in
      __alloc_alien_cache(),
      
          alc = kmalloc_node(memsize, gfp, node);
          init_arraycache(&alc->ac, entries, batch);
          kmemleak_no_scan(ac);
      
      Kmemleak does not track the array cache (alc->ac) but the alien cache
      (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
      out of init_arraycache().
      
      There is another place that calls init_arraycache(), but
      alloc_kmem_cache_cpus() uses the percpu allocation where will never be
      considered as a leak.
      
        kmemleak: Found object by alias at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         lookup_object+0x84/0xac
         find_and_get_object+0x84/0xe4
         kmemleak_no_scan+0x74/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
        kmemleak: Object 0xffff8007b9aa7e00 (size 256):
        kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
        kmemleak:   min_count = 1
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             kmemleak_alloc+0x84/0xb8
             kmem_cache_alloc_node_trace+0x31c/0x3a0
             __kmalloc_node+0x58/0x78
             setup_kmem_cache_node+0x26c/0x35c
             __do_tune_cpucache+0x250/0x2d4
             do_tune_cpucache+0x4c/0xe4
             enable_cpucache+0xc8/0x110
             setup_cpu_cache+0x40/0x1b8
             __kmem_cache_create+0x240/0x358
             create_cache+0xc0/0x198
             kmem_cache_create_usercopy+0x158/0x20c
             kmem_cache_create+0x50/0x64
             fsnotify_init+0x58/0x6c
             do_one_initcall+0x194/0x388
             kernel_init_freeable+0x668/0x688
             kernel_init+0x18/0x124
        kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         kmemleak_no_scan+0x90/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
      Fixes: 1fe00d50 ("slab: factor out initialization of array cache")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92d1d07d
    • P
      mm/slub.c: freelist is ensured to be NULL when new_slab() fails · edde82b6
      Peng Wang 提交于
      new_slab_objects() will return immediately if freelist is not NULL.
      
               if (freelist)
                       return freelist;
      
      One more assignment operation could be avoided.
      
      Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cnSigned-off-by: NPeng Wang <rocking@whu.edu.cn>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edde82b6