1. 10 3月, 2018 1 次提交
    • D
      mm/page_alloc: fix memmap_init_zone pageblock alignment · 864b75f9
      Daniel Vacek 提交于
      Commit b92df1de ("mm: page_alloc: skip over regions of invalid pfns
      where possible") introduced a bug where move_freepages() triggers a
      VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
      To fix this, simply align the skipped pfns in memmap_init_zone() the
      same way as in move_freepages_block().
      
      Seen in one of the RHEL reports:
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010:[<ffffffff8118833e>]  [<ffffffff8118833e>] move_freepages+0x15e/0x160
        RSP: 0018:ffff88054d727688  EFLAGS: 00010087
        --
        Call Trace:
         [<ffffffff811883b3>] move_freepages_block+0x73/0x80
         [<ffffffff81189e63>] __rmqueue+0x263/0x460
         [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
         [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
        --
        RIP  [<ffffffff8118833e>] move_freepages+0x15e/0x160
         RSP <ffff88054d727688>
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff	System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff	System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff	System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff	System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff	System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
      Note that this range follows two not populated sections
      68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
      after a gap.  This makes memmap_init_zone() skip all the pfns up to the
      beginning of this range.  But this range is not pageblock (2M) aligned.
      In fact no range has to be.
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0	<<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Top part of page flags should contain nodeid and zonenr, which is not
      the case for page ffffea0001ed8000 here (<<<<).
      
        crash> log | grep -o fffea0001ed[^\ ]* | sort -u
        fffea0001ed8000
        fffea0001eded20
        fffea0001edffc0
      
        crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
        fffea0001ed8000
        fffea0001eded00
        fffea0001eded20
        fffea0001edffc0
      
      Initialization of the whole beginning of the section is skipped up to
      the start of the range due to the commit b92df1de.  Now any code
      calling move_freepages_block() (like reusing the page from a freelist as
      in this example) with a page from the beginning of the range will get
      the page rounded down to start_page ffffea0001ed8000 and passed to
      move_freepages() which crashes on assertion getting wrong zonenr.
      
        >         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
      
      Note, page_zone() derives the zone from page flags here.
      
      From similar machine before commit b92df1de:
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        fffff73941e00000  78000000                0        0  1 1fffff00000000
        fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
        fffff73941ed8000  7b600000                0        0  1 1fffff00000000
        fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
        fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk
      
      All the pages since the beginning of the section are initialized.
      move_freepages()' not gonna blow up.
      
      The same machine with this fix applied:
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001e00000  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
        ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
        ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk
      
      At least the bare minimum of pages is initialized preventing the crash
      as well.
      
      Customers started to report this as soon as 7.4 (where b92df1de was
      merged in RHEL) was released.  I remember reports from
      September/October-ish times.  It's not easily reproduced and happens on
      a handful of machines only.  I guess that's why.  But that does not make
      it less serious, I think.
      
      Though there actually is a report here:
        https://bugzilla.kernel.org/show_bug.cgi?id=196443
      
      And there are reports for Fedora from July:
        https://bugzilla.redhat.com/show_bug.cgi?id=1473242
      and CentOS:
        https://bugs.centos.org/view.php?id=13964
      and we internally track several dozens reports for RHEL bug
        https://bugzilla.redhat.com/show_bug.cgi?id=1525121
      
      Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b75f9
  2. 22 2月, 2018 1 次提交
    • J
      mm: don't defer struct page initialization for Xen pv guests · 895f7b8e
      Juergen Gross 提交于
      Commit f7f99100 ("mm: stop zeroing memory during allocation in
      vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
      information in struct page of early page tables could get lost.
      
      This will lead to the kernel trying to write directly into the page
      tables instead of asking the hypervisor to do so.  The result is a crash
      like the following:
      
        BUG: unable to handle kernel paging request at ffff8801ead19008
        IP: xen_set_pud+0x4e/0xd0
        PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
        Oops: 0003 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
        Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
        task: ffffffff81c10480 task.stack: ffffffff81c00000
        RIP: e030:xen_set_pud+0x4e/0xd0
        Call Trace:
         __pmd_alloc+0x128/0x140
         ioremap_page_range+0x3f4/0x410
         __ioremap_caller+0x1c3/0x2e0
         acpi_os_map_iomem+0x175/0x1b0
         acpi_tb_acquire_table+0x39/0x66
         acpi_tb_validate_table+0x44/0x7c
         acpi_tb_verify_temp_table+0x45/0x304
         acpi_reallocate_root_table+0x12d/0x141
         acpi_early_init+0x4d/0x10a
         start_kernel+0x3eb/0x4a1
         xen_start_kernel+0x528/0x532
        Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d <4c> 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
        RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
        CR2: ffff8801ead19008
        ---[ end trace 38eca2e56f1b642e ]---
      
      Avoid this problem by not deferring struct page initialization when
      running as Xen pv guest.
      
      Pavel said:
      
      : This is unique for Xen, so this particular issue won't effect other
      : configurations.  I am going to investigate if there is a way to
      : re-enable deferred page initialization on xen guests.
      
      [akpm@linux-foundation.org: explicitly include xen.h]
      Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.15.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      895f7b8e
  3. 01 2月, 2018 4 次提交
  4. 09 1月, 2018 1 次提交
  5. 05 1月, 2018 1 次提交
    • D
      mm: check pfn_valid first in zero_resv_unavail · e8c24773
      Dave Young 提交于
      With latest kernel I get below bug while testing kdump:
      
        BUG: unable to handle kernel paging request at ffffea00034b1040
        IP: zero_resv_unavail+0xbd/0x126
        PGD 37b98067 P4D 37b98067 PUD 37b97067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.15.0-rc1+ #316
        Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 ) 03/03/2017
        task: ffffffff81a0e4c0 task.stack: ffffffff81a00000
        RIP: 0010:zero_resv_unavail+0xbd/0x126
        RSP: 0000:ffffffff81a03d88 EFLAGS: 00010006
        RAX: 0000000000000000 RBX: ffffea00034b1040 RCX: 0000000000000010
        RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffea00034b1040
        RBP: 00000000000d2c41 R08: 00000000000000c0 R09: 0000000000000a0d
        R10: 0000000000000002 R11: 0000000000007f01 R12: ffffffff81a03d90
        R13: ffffea0000000000 R14: 0000000000000063 R15: 0000000000000062
        FS:  0000000000000000(0000) GS:ffffffff81c73000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea00034b1040 CR3: 0000000037609000 CR4: 00000000000606b0
        Call Trace:
         ? free_area_init_nodes+0x640/0x664
         ? zone_sizes_init+0x58/0x72
         ? setup_arch+0xb50/0xc6c
         ? start_kernel+0x64/0x43d
         ? secondary_startup_64+0xa5/0xb0
        Code: c1 e8 0c 48 39 d8 76 27 48 89 de 48 c1 e3 06 48 c7 c7 7a 87 79 81 e8 b0 c0 3e ff 4c 01 eb b9 10 00 00 00 31 c0 48 89 df 49 ff c6 <f3> ab eb bc 6a 00 49 c7 c0 f0 93 d1 81 31 d2 83 ce ff 41 54 49
        RIP: zero_resv_unavail+0xbd/0x126 RSP: ffffffff81a03d88
        CR2: ffffea00034b1040
        ---[ end trace f5ba9e8f73c7ee26 ]---
      
      This is introduced by commit a4a3ede2 ("mm: zero reserved and
      unavailable struct pages").
      
      The reason is some efi reserved boot ranges is not reported in E820 ram.
      In my case it is a bgrt buffer:
      
        efi: mem00: [Boot Data          |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x00000000d2c41000-0x00000000d2c85fff] (0MB)
      
      Use "add_efi_memmap" can workaround the problem with another fix:
      
        http://lkml.kernel.org/r/20171130052327.GA3500@dhcp-128-65.nay.redhat.com
      
      In zero_resv_unavail it would be better to check pfn_valid first before
      zero the page struct.  This fixes the problem and potential other
      similar problems.  Also as Pavel Tatashin suggested checks pfn_valid at
      the beginning of the section.
      
      The range is backed by real memory.  The memory range is efi "Boot
      Service Data", that means after ExitBootServices() these ranges can be
      used as system ram.  But some of them need to be reserved, for example
      the bgrt image address in an acpi table, if the image memory is freed
      then kexec reboot will fail because kexec inherit same acpi table to
      initialize the driver.
      
      Link: http://lkml.kernel.org/r/20171201095048.GA3084@dhcp-128-65.nay.redhat.com
      Fixes: a4a3ede2 ("mm: zero reserved and unavailable struct pages")
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8c24773
  6. 15 12月, 2017 1 次提交
  7. 30 11月, 2017 2 次提交
    • M
      mm/cma: fix alloc_contig_range ret code/potential leak · 63cd4489
      Mike Kravetz 提交于
      If the call __alloc_contig_migrate_range() in alloc_contig_range returns
      -EBUSY, processing continues so that test_pages_isolated() is called
      where there is a tracepoint to identify the busy pages.  However, it is
      possible for busy pages to become available between the calls to these
      two routines.  In this case, the range of pages may be allocated.
      Unfortunately, the original return code (ret == -EBUSY) is still set and
      returned to the caller.  Therefore, the caller believes the pages were
      not allocated and they are leaked.
      
      Update the comment to indicate that allocation is still possible even if
      __alloc_contig_migrate_range returns -EBUSY.  Also, clear return code in
      this case so that it is not accidentally used or returned to caller.
      
      Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
      Fixes: 8ef5849f ("mm/cma: always check which page caused allocation failure")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63cd4489
    • M
      mm, memory_hotplug: do not back off draining pcp free pages from kworker context · 4b81cb2f
      Michal Hocko 提交于
      drain_all_pages backs off when called from a kworker context since
      commit 0ccce3b9 ("mm, page_alloc: drain per-cpu pages from workqueue
      context") because the original IPI based pcp draining has been replaced
      by a WQ based one and the check wanted to prevent from recursion and
      inter workers dependencies.  This has made some sense at the time
      because the system WQ has been used and one worker holding the lock
      could be blocked while waiting for new workers to emerge which can be a
      problem under OOM conditions.
      
      Since then commit ce612879 ("mm: move pcp and lru-pcp draining into
      single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
      rescuer so we shouldn't depend on any other WQ activity to make a
      forward progress so calling drain_all_pages from a worker context is
      safe as long as this doesn't happen from mm_percpu_wq itself which is
      not the case because all workers are required to _not_ depend on any MM
      locks.
      
      Why is this a problem in the first place? ACPI driven memory hot-remove
      (acpi_device_hotplug) is executed from the worker context.  We end up
      calling __offline_pages to free all the pages and that requires both
      lru_add_drain_all_cpuslocked and drain_all_pages to do their job
      otherwise we can have dangling pages on pcp lists and fail the offline
      operation (__test_page_isolated_in_pageblock would see a page with 0 ref
      count but without PageBuddy set).
      
      Fix the issue by removing the worker check in drain_all_pages.
      lru_add_drain_all_cpuslocked doesn't have this restriction so it works
      as expected.
      
      Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org
      Fixes: 0ccce3b9 ("mm, page_alloc: drain per-cpu pages from workqueue context")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b81cb2f
  8. 18 11月, 2017 1 次提交
  9. 16 11月, 2017 19 次提交
    • O
      mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP · 0cd842f9
      Oscar Salvador 提交于
      free_area_init_node() calls alloc_node_mem_map(), but this function does
      nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.
      
      As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
      alloc_node_mem_map() out of the function, and define a
      alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.
      
      This also moves the printk that lays within the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
      alloc_node_mem_map(), getting rid of the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().
      
      [akpm@linux-foundation.org: clean up the printk while we're there]
      Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.netSigned-off-by: NOscar Salvador <osalvador@techadventures.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd842f9
    • M
      mm: simplify nodemask printing · 0205f755
      Michal Hocko 提交于
      alloc_warn() and dump_header() have to explicitly handle NULL nodemask
      which forces both paths to use pr_cont.  We can do better.  printk
      already handles NULL pointers properly so all we need is to teach
      nodemask_pr_args to handle NULL nodemask carefully.  This allows
      simplification of both alloc_warn() and dump_header() and gets rid of
      pr_cont altogether.
      
      This patch has been motivated by patch from Joe Perches
      
        http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com
      
      [akpm@linux-foundation.org: fix tile warning, per Arnd]
      Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJoe Perches <joe@perches.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0205f755
    • P
      mm/page_alloc.c: broken deferred calculation · d135e575
      Pavel Tatashin 提交于
      In reset_deferred_meminit() we determine number of pages that must not
      be deferred.  We initialize pages for at least 2G of memory, but also
      pages for reserved memory in this node.
      
      The reserved memory is determined in this function:
      memblock_reserved_memory_within(), which operates over physical
      addresses, and returns size in bytes.  However, reset_deferred_meminit()
      assumes that that this function operates with pfns, and returns page
      count.
      
      The result is that in the best case machine boots slower than expected
      due to initializing more pages than needed in single thread, and in the
      worst case panics because fewer than needed pages are initialized early.
      
      Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
      Fixes: 864b9a39 ("mm: consider memblock reservations for deferred memory initialization sizing")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d135e575
    • T
      mm: don't warn about allocations which stall for too long · 400e2249
      Tetsuo Handa 提交于
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: Nyuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      400e2249
    • V
      mm, page_alloc: fix potential false positive in __zone_watermark_ok · b050e376
      Vlastimil Babka 提交于
      Since commit 97a16fc8 ("mm, page_alloc: only enforce watermarks for
      order-0 allocations"), __zone_watermark_ok() check for high-order
      allocations will shortcut per-migratetype free list checks for
      ALLOC_HARDER allocations, and return true as long as there's free page
      of any migratetype.  The intention is that ALLOC_HARDER can allocate
      from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
      
      However, as a side effect, the watermark check will then also return
      true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
      CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
      allocation cannot actually obtain isolated pages, and might not be able
      to obtain CMA pages, this can result in a false positive.
      
      The condition should be rare and perhaps the outcome is not a fatal one.
      Still, it's better if the watermark check is correct.  There also
      shouldn't be a performance tradeoff here.
      
      Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
      Fixes: 97a16fc8 ("mm, page_alloc: only enforce watermarks for order-0 allocations")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b050e376
    • K
      mm, sysctl: make NUMA stats configurable · 4518085e
      Kemi Wang 提交于
      This is the second step which introduces a tunable interface that allow
      numa stats configurable for optimizing zone_statistics(), as suggested
      by Dave Hansen and Ying Huang.
      
      =========================================================================
      
      When page allocation performance becomes a bottleneck and you can
      tolerate some possible tool breakage and decreased numa counter
      precision, you can do:
      
      	echo 0 > /proc/sys/vm/numa_stat
      
      In this case, numa counter update is ignored.  We can see about
      *4.8%*(185->176) drop of cpu cycles per single page allocation and
      reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
      drop of cpu cycles per single page allocation and reclaim on Jesper's
      page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
      (88 threads, 126G memory).
      
      Benchmark link provided by Jesper D Brouer (increase loop times to
      10000000):
      
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      =========================================================================
      
      When page allocation performance is not a bottleneck and you want all
      tooling to work, you can do:
      
      	echo 1 > /proc/sys/vm/numa_stat
      
      This is system default setting.
      
      Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
      for comments to help improve the original patch.
      
      [keescook@chromium.org: make sure mutex is a global static]
        Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
      Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.comSigned-off-by: NKemi Wang <kemi.wang@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Suggested-by: NYing Huang <ying.huang@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4518085e
    • V
      mm, page_alloc: simplify list handling in rmqueue_bulk() · 0fac3ba5
      Vlastimil Babka 提交于
      rmqueue_bulk() fills an empty pcplist with pages from the free list.  It
      tries to preserve increasing order by pfn to the caller, because it
      leads to better performance with some I/O controllers, as explained in
      commit e084b2d9 ("page-allocator: preserve PFN ordering when
      __GFP_COLD is set").
      
      To preserve the order, it's sufficient to add pages to the tail of the
      list as they are retrieved.  The current code instead adds to the head
      of the list, but then updates the list head pointer to the last added
      page, in each step.  This does result in the same order, but is
      needlessly confusing and potentially wasteful, with no apparent benefit.
      This patch simplifies the code and adjusts comment accordingly.
      
      Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fac3ba5
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • M
      mm: remove cold parameter from free_hot_cold_page* · 2d4894b5
      Mel Gorman 提交于
      Most callers users of free_hot_cold_page claim the pages being released
      are cache hot.  The exception is the page reclaim paths where it is
      likely that enough pages will be freed in the near future that the
      per-cpu lists are going to be recycled and the cache hotness information
      is lost.  As no one really cares about the hotness of pages being
      released to the allocator, just ditch the parameter.
      
      The APIs are renamed to indicate that it's no longer about hot/cold
      pages.  It should also be less confusing as there are subtle differences
      between them.  __free_pages drops a reference and frees a page when the
      refcount reaches zero.  free_hot_cold_page handled pages whose refcount
      was already zero which is non-obvious from the name.  free_unref_page
      should be more obvious.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      [mgorman@techsingularity.net: add pages to head, not tail]
        Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d4894b5
    • M
      mm, page_alloc: enable/disable IRQs once when freeing a list of pages · 9cca35d4
      Mel Gorman 提交于
      Patch series "Follow-up for speed up page cache truncation", v2.
      
      This series is a follow-on for Jan Kara's series "Speed up page cache
      truncation" series.  We both ended up looking at the same problem but
      saw different problems based on the same data.  This series builds upon
      his work.
      
      A variety of workloads were compared on four separate machines but each
      machine showed gains albeit at different levels.  Minimally, some of the
      differences are due to NUMA where truncating data from a remote node is
      slower than a local node.  The workloads checked were
      
      o sparse truncate microbenchmark, tiny
      o sparse truncate microbenchmark, large
      o reaim-io disk workfile
      o dbench4 (modified by mmtests to produce more stable results)
      o filebench varmail configuration for small memory size
      o bonnie, directory operations, working set size 2*RAM
      
      reaim-io, dbench and filebench all showed minor gains.  Truncation does
      not dominate those workloads but were tested to ensure no other
      regressions.  They will not be reported further.
      
      The sparse truncate microbench was written by Jan.  It creates a number
      of files and then times how long it takes to truncate each one.  The
      "tiny" configuraiton creates a number of files that easily fits in
      memory and times how long it takes to truncate files with page cache.
      The large configuration uses enough files to have data that is twice the
      size of memory and so timings there include truncating page cache and
      working set shadow entries in the radix tree.
      
      Patches 1-4 are the most relevant parts of this series.  Patches 5-8 are
      optional as they are deleting code that is essentially useless but has a
      negligible performance impact.
      
      The changelogs have more information on performance but just for bonnie
      delete options, the main comparison is
      
      bonnie
                                            4.14.0-rc5             4.14.0-rc5             4.14.0-rc5
                                                jan-v2                vanilla                 mel-v2
      Hmean     SeqCreate ops         76.20 (   0.00%)       75.80 (  -0.53%)       76.80 (   0.79%)
      Hmean     SeqCreate read        85.00 (   0.00%)       85.00 (   0.00%)       85.00 (   0.00%)
      Hmean     SeqCreate del      13752.31 (   0.00%)    12090.23 ( -12.09%)    15304.84 (  11.29%)
      Hmean     RandCreate ops        76.00 (   0.00%)       75.60 (  -0.53%)       77.00 (   1.32%)
      Hmean     RandCreate read       96.80 (   0.00%)       96.80 (   0.00%)       97.00 (   0.21%)
      Hmean     RandCreate del     13233.75 (   0.00%)    11525.35 ( -12.91%)    14446.61 (   9.16%)
      
      Jan's series is the baseline and the vanilla kernel is 12% slower where
      as this series on top gains another 11%.  This is from a different
      machine than the data in the changelogs but the detailed data was not
      collected as there was no substantial change in v2.
      
      This patch (of 8):
      
      Freeing a list of pages current enables/disables IRQs for each page
      freed.  This patch splits freeing a list of pages into two operations --
      preparing the pages for freeing and the actual freeing.  This is a
      tradeoff - we're taking two passes of the list to free in exchange for
      avoiding multiple enable/disable of IRQs.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                 janbatch-v1r1            oneirq-v1r1
      Min          Time      149.00 (   0.00%)      141.00 (   5.37%)
      1st-qrtle    Time      150.00 (   0.00%)      142.00 (   5.33%)
      2nd-qrtle    Time      151.00 (   0.00%)      142.00 (   5.96%)
      3rd-qrtle    Time      151.00 (   0.00%)      143.00 (   5.30%)
      Max-90%      Time      153.00 (   0.00%)      144.00 (   5.88%)
      Max-95%      Time      155.00 (   0.00%)      147.00 (   5.16%)
      Max-99%      Time      201.00 (   0.00%)      195.00 (   2.99%)
      Max          Time      236.00 (   0.00%)      230.00 (   2.54%)
      Amean        Time      152.65 (   0.00%)      144.37 (   5.43%)
      Stddev       Time        9.78 (   0.00%)       10.44 (  -6.72%)
      Coeff        Time        6.41 (   0.00%)        7.23 ( -12.84%)
      Best99%Amean Time      152.07 (   0.00%)      143.72 (   5.50%)
      Best95%Amean Time      150.75 (   0.00%)      142.37 (   5.56%)
      Best90%Amean Time      150.59 (   0.00%)      142.19 (   5.58%)
      Best75%Amean Time      150.36 (   0.00%)      141.92 (   5.61%)
      Best50%Amean Time      150.04 (   0.00%)      141.69 (   5.56%)
      Best25%Amean Time      149.85 (   0.00%)      141.38 (   5.65%)
      
      With a tiny number of files, each file truncated has resident page cache
      and it shows that time to truncate is roughtly 5-6% with some minor
      jitter.
      
                                            4.14.0-rc4             4.14.0-rc4
                                         janbatch-v1r1            oneirq-v1r1
      Hmean     SeqCreate ops         65.27 (   0.00%)       81.86 (  25.43%)
      Hmean     SeqCreate read        39.48 (   0.00%)       47.44 (  20.16%)
      Hmean     SeqCreate del      24963.95 (   0.00%)    26319.99 (   5.43%)
      Hmean     RandCreate ops        65.47 (   0.00%)       82.01 (  25.26%)
      Hmean     RandCreate read       42.04 (   0.00%)       51.75 (  23.09%)
      Hmean     RandCreate del     23377.66 (   0.00%)    23764.79 (   1.66%)
      
      As expected, there is a small gain for the delete operation.
      
      [mgorman@techsingularity.net: use page_private and set_page_private helpers]
        Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cca35d4
    • A
      mm/page_alloc: make sure __rmqueue() etc are always inline · 85ccc8fa
      Aaron Lu 提交于
      __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
      __rmqueue_cma_fallback() are all in page allocator's hot path and better
      be finished as soon as possible.  One way to make them faster is by making
      them inline.  But as Andrew Morton and Andi Kleen pointed out:
      
        https://lkml.org/lkml/2017/10/10/1252
        https://lkml.org/lkml/2017/10/10/1279
      
      To make sure they are inlined, we should use __always_inline for them.
      
      With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
      processes to stress buddy, the results for will-it-scale.processes with
      and without the patch are:
      
      On a 2-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7       6496131     6911823 +6.4%
        gcc-4.9.4       7225110     7731072 +7.0%
        gcc-5.4.1       7054224     7688146 +9.0%
        gcc-6.2.0       7059794     7651675 +8.4%
      
      On a 4-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7      13162890    13508193 +2.6%
        gcc-4.9.4      14997463    15484353 +3.2%
        gcc-5.4.1      14708711    15449805 +5.0%
        gcc-6.2.0      14574099    15349204 +5.3%
      
      The above 4 compilers are used because I've done the tests through
      Intel's Linux Kernel Performance(LKP) infrastructure and they are the
      available compilers there.
      
      The benefit being less on 4 sockets machine is due to the lock
      contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
      less severe than on the 2 sockets machine(85%).
      
      What the benchmark does is: it forks nr_cpu processes and then each
      process does the following:
          1 mmap() 128M anonymous space;
          2 writes to each page there to trigger actual page allocation;
          3 munmap() it.
      in a loop.
      
        https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Binary size wise, I have locally built them with different compilers:
      
      [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
         text    data     bss     dec     hex filename
        37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
        38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
        37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
        38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
        37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
        38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
        36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
        37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
      
      Text size increased about 800 bytes for mm/page_alloc.o.
      
      [aaron@aaronlu obj]$ size */*/vmlinux
         text    data     bss     dec       hex     filename
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
      
      Text size for vmlinux has no change though, probably due to function
      alignment.
      
      Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85ccc8fa
    • P
      mm: stop zeroing memory during allocation in vmemmap · f7f99100
      Pavel Tatashin 提交于
      vmemmap_alloc_block() will no longer zero the block, so zero memory at
      its call sites for everything except struct pages.  Struct page memory
      is zero'd by struct page initialization.
      
      Replace allocators in sparse-vmemmap to use the non-zeroing version.
      So, we will get the performance improvement by zeroing the memory in
      parallel when struct pages are zeroed.
      
      Add struct page zeroing as a part of initialization of other fields in
      __init_single_page().
      
      This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
      v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
      
                               BASE            FIX
      sparse_init     11.244671836s   0.007199623s
      zone_sizes_init  4.879775891s   8.355182299s
                        --------------------------
      Total           16.124447727s   8.362381922s
      
      sparse_init is where memory for struct pages is zeroed, and the zeroing
      part is moved later in this patch into __init_single_page(), which is
      called from zone_sizes_init().
      
      [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
      Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7f99100
    • P
      mm: zero reserved and unavailable struct pages · a4a3ede2
      Pavel Tatashin 提交于
      Some memory is reserved but unavailable: not present in memblock.memory
      (because not backed by physical pages), but present in memblock.reserved.
      Such memory has backing struct pages, but they are not initialized by
      going through __init_single_page().
      
      In some cases these struct pages are accessed even if they do not
      contain any data.  One example is page_to_pfn() might access page->flags
      if this is where section information is stored (CONFIG_SPARSEMEM,
      SECTION_IN_PAGE_FLAGS).
      
      One example of such memory: trim_low_memory_range() unconditionally
      reserves from pfn 0, but e820__memblock_setup() might provide the
      exiting memory from pfn 1 (i.e.  KVM).
      
      Since struct pages are zeroed in __init_single_page(), and not during
      allocation time, we must zero such struct pages explicitly.
      
      The patch involves adding a new memblock iterator:
      	for_each_resv_unavail_range(i, p_start, p_end)
      
      Which iterates through reserved && !memory lists, and we zero struct pages
      explicitly by calling mm_zero_struct_page().
      
      ===
      
      Here is more detailed example of problem that this patch is addressing:
      
      Run tested on qemu with the following arguments:
      
      	-enable-kvm -cpu kvm64 -m 512 -smp 2
      
      This patch reports that there are 98 unavailable pages.
      
      They are: pfn 0 and pfns in range [159, 255].
      
      Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
      not reserve [159, 255] ones.
      
      e820__memblock_setup() reports linux that the following physical ranges are
      available:
          [1 , 158]
      [256, 130783]
      
      Notice, that exactly unavailable pfns are missing!
      
      Now, lets check what we have in zone 0: [1, 131039]
      
      pfn 0, is not part of the zone, but pfns [1, 158], are.
      
      However, the bigger problem we have if we do not initialize these struct
      pages is with memory hotplug.  Because, that path operates at 2M
      boundaries (section_nr).  And checks if 2M range of pages is hot
      removable.  It starts with first pfn from zone, rounds it down to 2M
      boundary (sturct pages are allocated at 2M boundaries when vmemmap is
      created), and checks if that section is hot removable.  In this case
      start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
      to struct page, and some fields are checked.  Now, if we do not zero
      struct pages, we get unpredictable results.
      
      In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
      vmemmap memory to ones, the following panic is observed with kernel test
      without this patch applied:
      
        BUG: unable to handle kernel NULL pointer dereference at          (null)
        IP: is_pageblock_removable_nolock+0x35/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT
        ...
        task: ffff88001f4e2900 task.stack: ffffc90000314000
        RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
        Call Trace:
         ? is_mem_section_removable+0x5a/0xd0
         show_mem_removable+0x6b/0xa0
         dev_attr_show+0x1b/0x50
         sysfs_kf_seq_show+0xa1/0x100
         kernfs_seq_show+0x22/0x30
         seq_read+0x1ac/0x3a0
         kernfs_fop_read+0x36/0x190
         ? security_file_permission+0x90/0xb0
         __vfs_read+0x16/0x30
         vfs_read+0x81/0x130
         SyS_read+0x44/0xa0
         entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a3ede2
    • P
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin 提交于
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • P
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin 提交于
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • M
      mm, page_alloc: fail has_unmovable_pages when seeing reserved pages · d7ab3672
      Michal Hocko 提交于
      Reserved pages should be completely ignored by the core mm because they
      have a special meaning for their owners.  has_unmovable_pages doesn't
      check those so we rely on other tests (reference count, or PageLRU) to
      fail on such pages.  Althought this happens to work it is safer to
      simply check for those explicitly and do not rely on the owner of the
      page to abuse those fields for special purposes.
      
      Please note that this is more of a further fortification of the code
      rahter than a fix of an existing issue.
      
      Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7ab3672
    • M
      mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() · 4da2ce25
      Michal Hocko 提交于
      Joonsoo has noticed that "mm: drop migrate type checks from
      has_unmovable_pages" would break CMA allocator because it relies on
      has_unmovable_pages returning false even for CMA pageblocks which in
      fact don't have to be movable:
      
       alloc_contig_range
         start_isolate_page_range
           set_migratetype_isolate
             has_unmovable_pages
      
      This is a result of the code sharing between CMA and memory hotplug
      while each one has a different idea of what has_unmovable_pages should
      return.  This is unfortunate but fixing it properly would require a lot
      of code duplication.
      
      Fix the issue by introducing the requested migrate type argument and
      special case MIGRATE_CMA case where CMA page blocks are handled
      properly.  This will work for memory hotplug because it requires
      MIGRATE_MOVABLE.
      
      Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: NRan Wang <ran.wang_1@nxp.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da2ce25
    • M
      mm: drop migrate type checks from has_unmovable_pages · d7b236e1
      Michal Hocko 提交于
      Michael has noticed that the memory offline tries to migrate kernel code
      pages when doing
      
       echo 0 > /sys/devices/system/memory/memory0/online
      
      The current implementation will fail the operation after several failed
      page migration attempts but we shouldn't even attempt to migrate that
      memory and fail right away because this memory is clearly not
      migrateable.  This will become a real problem when we drop the retry
      loop counter resp.  timeout.
      
      The real problem is in has_unmovable_pages in fact.  We should fail if
      there are any non migrateable pages in the area.  In orther to guarantee
      that remove the migrate type checks because MIGRATE_MOVABLE is not
      guaranteed to contain only migrateable pages.  It is merely a heuristic.
      Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
      allocate any non-migrateable pages from the block but CMA allocations
      themselves are unlikely to migrateable.  Therefore remove both checks.
      
      [akpm@linux-foundation.org: remove unused local `mt']
      Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Tested-by: NRan Wang <ran.wang_1@nxp.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7b236e1
  10. 07 11月, 2017 1 次提交
  11. 20 10月, 2017 1 次提交
  12. 04 10月, 2017 2 次提交
  13. 09 9月, 2017 2 次提交
    • T
      mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt · f19360f0
      Tetsuo Handa 提交于
      We are by error initializing alloc_flags before gfp_allowed_mask is
      applied.  This could cause problems after pm_restrict_gfp_mask() is called
      during suspend operation.  Apply gfp_allowed_mask before initializing
      alloc_flags so that the first allocation attempt uses correct flags.
      
      Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
      Fixes: 83d4ca81 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19360f0
    • K
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang 提交于
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.comSigned-off-by: NKemi Wang <kemi.wang@intel.com>
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  14. 07 9月, 2017 3 次提交
    • M
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko 提交于
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • M
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko 提交于
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32