1. 20 4月, 2019 3 次提交
    • H
      mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES · dd862deb
      Hugh Dickins 提交于
      SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
      further testing has proved it to be a source of unnecessary swapoff
      EBUSY failures (which can then be followed by unmount EBUSY failures).
      
      When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
      or inode being evicted, freeing up swap independent of try_to_unuse().
      Those typically completed much sooner than the old quadratic swapoff,
      but now it's more common that swapoff may need to wait for them.
      
      It's possible to move those cases from init_mm.mmlist and shmem_swaplist
      to separate "exiting" swaplists, and try_to_unuse() then wait for those
      lists to be emptied; but we've not bothered with that in the past, and
      don't want to risk missing some other forgotten case.  So just revert to
      cycling around until the swap is gone, without any retries limit.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd862deb
    • H
      mm: swapoff: shmem_find_swap_entries() filter out other types · 87039546
      Hugh Dickins 提交于
      Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
      then forgotten from shmem_find_swap_entries(): with the result that
      removing one swapfile would try to free up all the swap from shmem - no
      problem when only one swapfile anyway, but counter-productive when more,
      causing swapoff to be unnecessarily OOM-killed when it should succeed.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87039546
    • Q
      slab: store tagged freelist for off-slab slabmgmt · 1a62b18d
      Qian Cai 提交于
      Commit 51dedad0 ("kasan, slab: make freelist stored without tags")
      calls kasan_reset_tag() for off-slab slab management object leading to
      freelist being stored non-tagged.
      
      However, cache_grow_begin() calls alloc_slabmgmt() which calls
      kmem_cache_alloc_node() assigns a tag for the address and stores it in
      the shadow address.  As the result, it causes endless errors below
      during boot due to drain_freelist() -> slab_destroy() ->
      kasan_slab_free() which compares already untagged freelist against the
      stored tag in the shadow address.
      
      Since off-slab slab management object freelist is such a special case,
      just store it tagged.  Non-off-slab management object freelist is still
      stored untagged which has not been assigned a tag and should not cause
      any other troubles with this inconsistency.
      
        BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
        Pointer tag: [ff], memory tag: [99]
      
        CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G        W 5.1.0-rc3+ #8
        Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.6 07/10/2018
        Workqueue: cgroup_destroy css_killed_work_fn
        Call trace:
         print_address_description+0x74/0x2a4
         kasan_report_invalid_free+0x80/0xc0
         __kasan_slab_free+0x204/0x208
         kasan_slab_free+0xc/0x18
         kmem_cache_free+0xe4/0x254
         slab_destroy+0x84/0x88
         drain_freelist+0xd0/0x104
         __kmem_cache_shrink+0x1ac/0x224
         __kmemcg_cache_deactivate+0x1c/0x28
         memcg_deactivate_kmem_caches+0xa0/0xe8
         memcg_offline_kmem+0x8c/0x3d4
         mem_cgroup_css_offline+0x24c/0x290
         css_killed_work_fn+0x154/0x618
         process_one_work+0x9cc/0x183c
         worker_thread+0x9b0/0xe38
         kthread+0x374/0x390
         ret_from_fork+0x10/0x18
      
        Allocated by task 1625:
         __kasan_kmalloc+0x168/0x240
         kasan_slab_alloc+0x18/0x20
         kmem_cache_alloc_node+0x1f8/0x3a0
         cache_grow_begin+0x4fc/0xa24
         cache_alloc_refill+0x2f8/0x3e8
         kmem_cache_alloc+0x1bc/0x3bc
         sock_alloc_inode+0x58/0x334
         alloc_inode+0xb8/0x164
         new_inode_pseudo+0x20/0xec
         sock_alloc+0x74/0x284
         __sock_create+0xb0/0x58c
         sock_create+0x98/0xb8
         __sys_socket+0x60/0x138
         __arm64_sys_socket+0xa4/0x110
         el0_svc_handler+0x2c0/0x47c
         el0_svc+0x8/0xc
      
        Freed by task 1625:
         __kasan_slab_free+0x114/0x208
         kasan_slab_free+0xc/0x18
         kfree+0x1a8/0x1e0
         single_release+0x7c/0x9c
         close_pdeo+0x13c/0x43c
         proc_reg_release+0xec/0x108
         __fput+0x2f8/0x784
         ____fput+0x1c/0x28
         task_work_run+0xc0/0x1b0
         do_notify_resume+0xb44/0x1278
         work_pending+0x8/0x10
      
        The buggy address belongs to the object at ffff809681b89e00
         which belongs to the cache kmalloc-128 of size 128
        The buggy address is located 0 bytes inside of
         128-byte region [ffff809681b89e00, ffff809681b89e80)
        The buggy address belongs to the page:
        page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
        index:0xffff809681b8fe04
        flags: 0x17ffffffc000200(slab)
        raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
        raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
        page dumped because: kasan: bad access detected
        page allocated via order 0, migratetype Unmovable, gfp_mask
        0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
         prep_new_page+0x4e0/0x5e0
         get_page_from_freelist+0x4ce8/0x50d4
         __alloc_pages_nodemask+0x738/0x38b8
         cache_grow_begin+0xd8/0xa24
         ____cache_alloc_node+0x14c/0x268
         __kmalloc+0x1c8/0x3fc
         ftrace_free_mem+0x408/0x1284
         ftrace_free_init_mem+0x20/0x28
         kernel_init+0x24/0x548
         ret_from_fork+0x10/0x18
      
        Memory state around the buggy address:
         ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
        >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
                           ^
         ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
      
      Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
      Fixes: 51dedad0 ("kasan, slab: make freelist stored without tags")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a62b18d
  2. 15 4月, 2019 1 次提交
  3. 08 4月, 2019 1 次提交
  4. 06 4月, 2019 4 次提交
    • A
      mm/util.c: fix strndup_user() comment · e9145521
      Andrew Morton 提交于
      The kerneldoc misdescribes strndup_user()'s return value.
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Timur Tabi <timur@freescale.com>
      Cc: Mihai Caraman <mihai.caraman@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9145521
    • G
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen 提交于
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
    • A
      mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() · c6f3c5ee
      Aneesh Kumar K.V 提交于
      With some architectures like ppc64, set_pmd_at() cannot cope with a
      situation where there is already some (different) valid entry present.
      
      Use pmdp_set_access_flags() instead to modify the pfn which is built to
      deal with modifying existing PMD entries.
      
      This is similar to commit cae85cb8 ("mm/memory.c: fix modifying of
      page protection by insert_pfn()")
      
      We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
      support pud pfn entries now.
      
      Without this patch we also see the below message in kernel log "BUG:
      non-zero pgtables_bytes on freeing mm:"
      
      Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NChandan Rajendra <chandan@linux.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6f3c5ee
    • C
      kmemleak: powerpc: skip scanning holes in the .bss section · 298a32b1
      Catalin Marinas 提交于
      Commit 2d4f5671 ("KVM: PPC: Introduce kvm_tmp framework") adds
      kvm_tmp[] into the .bss section and then free the rest of unused spaces
      back to the page allocator.
      
      kernel_init
        kvm_guest_init
          kvm_free_tmp
            free_reserved_area
              free_unref_page
                free_unref_page_prepare
      
      With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel.  As the
      result, kmemleak scan will trigger a panic when it scans the .bss
      section with unmapped pages.
      
      This patch creates dedicated kmemleak objects for the .data, .bss and
      potentially .data..ro_after_init sections to allow partial freeing via
      the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
      
      Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      298a32b1
  5. 04 4月, 2019 2 次提交
    • Q
      mm/compaction.c: abort search if isolation fails · 5b56d996
      Qian Cai 提交于
      Running LTP oom01 in a tight loop or memory stress testing put the system
      in a low-memory situation could triggers random memory corruption like
      page flag corruption below due to in fast_isolate_freepages(), if
      isolation fails, next_search_order() does not abort the search immediately
      could lead to improper accesses.
      
      UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
      index 7 is out of range for type 'zone [5]'
      Call Trace:
       dump_stack+0x62/0x9a
       ubsan_epilogue+0xd/0x7f
       __ubsan_handle_out_of_bounds+0x14d/0x192
       __isolate_free_page+0x52c/0x600
       compaction_alloc+0x886/0x25f0
       unmap_and_move+0x37/0x1e70
       migrate_pages+0x2ca/0xb20
       compact_zone+0x19cb/0x3620
       kcompactd_do_work+0x2df/0x680
       kcompactd+0x1d8/0x6c0
       kthread+0x32c/0x3f0
       ret_from_fork+0x35/0x40
      ------------[ cut here ]------------
      kernel BUG at mm/page_alloc.c:3124!
      invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      RIP: 0010:__isolate_free_page+0x464/0x600
      RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
      RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
      RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
      RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
      R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
      FS:  0000000000000000(0000) GS:ffff888ba8e80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       compaction_alloc+0x886/0x25f0
       unmap_and_move+0x37/0x1e70
       migrate_pages+0x2ca/0xb20
       compact_zone+0x19cb/0x3620
       kcompactd_do_work+0x2df/0x680
       kcompactd+0x1d8/0x6c0
       kthread+0x32c/0x3f0
       ret_from_fork+0x35/0x40
      
      Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
      Fixes: dbe2d4e4 ("mm, compaction: round-robin the order while searching the free lists for a target")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      5b56d996
    • M
      mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints · 6b0868c8
      Mel Gorman 提交于
      Mikhail Gavrilo reported the following bug being triggered in a Fedora
      kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.
      
       kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       kernel: ------------[ cut here ]------------
       kernel: kernel BUG at include/linux/mm.h:1021!
       kernel: invalid opcode: 0000 [#1] SMP NOPTI
       kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G         C        5.1.0-0.rc1.git1.3.fc31.x86_64 #1
       kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
       kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
       kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
       kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
       kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
       kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
       kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
       kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
       kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
       kernel: FS:  0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
       kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
       kernel: Call Trace:
       kernel:  __reset_isolation_suitable+0x62/0x120
       kernel:  reset_isolation_suitable+0x3b/0x40
       kernel:  kswapd+0x147/0x540
       kernel:  ? finish_wait+0x90/0x90
       kernel:  kthread+0x108/0x140
       kernel:  ? balance_pgdat+0x560/0x560
       kernel:  ? kthread_park+0x90/0x90
       kernel:  ret_from_fork+0x27/0x50
      
      He bisected it down to e332f741 ("mm, compaction: be selective about
      what pageblocks to clear skip hints").  The problem is that the patch in
      question was sloppy with respect to the handling of zone boundaries.  In
      some instances, it was possible for PFNs outside of a zone to be examined
      and if those were not properly initialised or poisoned then it would
      trigger the VM_BUG_ON.  This patch corrects the zone boundary issues when
      resetting pageblock skip hints and Mikhail reported that the bug did not
      trigger after 30 hours of testing.
      
      Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      6b0868c8
  6. 30 3月, 2019 10 次提交
  7. 16 3月, 2019 3 次提交
    • L
      filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior · 8b0f9fa2
      Linus Torvalds 提交于
      I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
      when looking at the error cases, there was one case where we returned
      VM_FAULT_RETRY without actually dropping the mmap_sem.
      
      Josef had to explain to me (using small words) that yes, that's actually
      what we're supposed to do, and his patch was correct.  Which not only
      convinced me he knew what he was doing and I should stop arguing with
      him, but also that I should add a comment to the case I was confused
      about.
      Patiently-pointed-out-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b0f9fa2
    • J
      filemap: drop the mmap_sem for all blocking operations · 6b4c9f44
      Josef Bacik 提交于
      Currently we only drop the mmap_sem if there is contention on the page
      lock.  The idea is that we issue readahead and then go to lock the page
      while it is under IO and we want to not hold the mmap_sem during the IO.
      
      The problem with this is the assumption that the readahead does anything.
      In the case that the box is under extreme memory or IO pressure we may end
      up not reading anything at all for readahead, which means we will end up
      reading in the page under the mmap_sem.
      
      Even if the readahead does something, it could get throttled because of io
      pressure on the system and the process is in a lower priority cgroup.
      
      Holding the mmap_sem while doing IO is problematic because it can cause
      system-wide priority inversions.  Consider some large company that does a
      lot of web traffic.  This large company has load balancing logic in it's
      core web server, cause some engineer thought this was a brilliant plan.
      This load balancing logic gets statistics from /proc about the system,
      which trip over processes mmap_sem for various reasons.  Now the web
      server application is in a protected cgroup, but these other processes may
      not be, and if they are being throttled while their mmap_sem is held we'll
      stall, and cause this nice death spiral.
      
      Instead rework filemap fault path to drop the mmap sem at any point that
      we may do IO or block for an extended period of time.  This includes while
      issuing readahead, locking the page, or needing to call ->readpage because
      readahead did not occur.  Then once we have a fully uptodate page we can
      return with VM_FAULT_RETRY and come back again to find our nicely in-cache
      page that was gotten outside of the mmap_sem.
      
      This patch also adds a new helper for locking the page with the mmap_sem
      dropped.  This doesn't make sense currently as generally speaking if the
      page is already locked it'll have been read in (unless there was an error)
      before it was unlocked.  However a forthcoming patchset will change this
      with the ability to abort read-ahead bio's if necessary, making it more
      likely that we could contend for a page lock and still have a not uptodate
      page.  This allows us to deal with this case by grabbing the lock and
      issuing the IO without the mmap_sem held, and then returning
      VM_FAULT_RETRY to come back around.
      
      [josef@toxicpanda.com: v6]
        Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
      [kirill@shutemov.name: fix race in filemap_fault()]
        Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4c9f44
    • J
      filemap: kill page_cache_read usage in filemap_fault · a75d4c33
      Josef Bacik 提交于
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a75d4c33
  8. 15 3月, 2019 1 次提交
  9. 13 3月, 2019 15 次提交
    • M
      mm: memblock: update comments and kernel-doc · a2974133
      Mike Rapoport 提交于
      * Remove comments mentioning bootmem
      * Extend "DOC: memblock overview"
      * Add kernel-doc comments for several more functions
      
      [akpm@linux-foundation.org: fix copy-n-paste error]
      Link: http://lkml.kernel.org/r/1549626347-25461-1-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2974133
    • M
      memblock: split checks whether a region should be skipped to a helper function · c9a688a3
      Mike Rapoport 提交于
      __next_mem_range() and __next_mem_range_rev() duplicate the code that
      checks whether a region should be skipped because of node or flags
      incompatibility.
      
      Split this code into a helper function.
      
      Link: http://lkml.kernel.org/r/1549455025-17706-3-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9a688a3
    • M
      memblock: remove memblock_{set,clear}_region_flags · fe145124
      Mike Rapoport 提交于
      The memblock API provides dedicated helpers to set or clear a flag on a
      memory region, e.g.  memblock_{mark,clear}_hotplug().
      
      The memblock_{set,clear}_region_flags() functions are used only by the
      memblock internal function that adjusts the region flags.  Drop these
      functions and use open-coded implementation instead.
      
      Link: http://lkml.kernel.org/r/1549455025-17706-2-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe145124
    • M
      memblock: drop memblock_alloc_*_nopanic() variants · 26fb3dae
      Mike Rapoport 提交于
      As all the memblock allocation functions return NULL in case of error
      rather than panic(), the duplicates with _nopanic suffix can be removed.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: Petr Mladek <pmladek@suse.com>		[printk]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fb3dae
    • M
      memblock: memblock_alloc_try_nid: don't panic · c0dbe825
      Mike Rapoport 提交于
      As all the memblock_alloc*() users are now checking the return value and
      panic() in case of error, the panic() call can be removed from the core
      memblock allocator, namely memblock_alloc_try_nid().
      
      Link: http://lkml.kernel.org/r/1548057848-15136-21-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0dbe825
    • M
      treewide: add checks for the return value of memblock_alloc*() · 8a7f97b9
      Mike Rapoport 提交于
      Add check for the return value of memblock_alloc*() functions and call
      panic() in case of error.  The panic message repeats the one used by
      panicing memblock allocators with adjustment of parameters to include
      only relevant ones.
      
      The replacement was mostly automated with semantic patches like the one
      below with manual massaging of format strings.
      
        @@
        expression ptr, size, align;
        @@
        ptr = memblock_alloc(size, align);
        + if (!ptr)
        + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);
      
      [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
        Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
      [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
        Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
      [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
        Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
      [akpm@linux-foundation.org: fix xtensa printk warning]
      Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
      Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
      Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a7f97b9
    • M
      mm/percpu: add checks for the return value of memblock_alloc*() · f655f405
      Mike Rapoport 提交于
      Add panic() calls if memblock_alloc() returns NULL.
      
      The panic() format duplicates the one used by memblock itself and in
      order to avoid explosion with long parameters list replace open coded
      allocation size calculations with a local variable.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f655f405
    • M
      memblock: make memblock_find_in_range_node() and choose_memblock_flags() static · c366ea89
      Mike Rapoport 提交于
      These functions are not used outside memblock.  Make them static.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-12-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c366ea89
    • M
      memblock: refactor internal allocation functions · 92d12f95
      Mike Rapoport 提交于
      Currently, memblock has several internal functions with overlapping
      functionality.  They all call memblock_find_in_range_node() to find free
      memory and then reserve the allocated range and mark it with kmemleak.
      However, there is difference in the allocation constraints and in
      fallback strategies.
      
      The allocations returning physical address first attempt to find free
      memory on the specified node within mirrored memory regions, then retry
      on the same node without the requirement for memory mirroring and
      finally fall back to all available memory.
      
      The allocations returning virtual address start with clamping the
      allowed range to memblock.current_limit, attempt to allocate from the
      specified node from regions with mirroring and with user defined minimal
      address.  If such allocation fails, next attempt is done with node
      restriction lifted.  Next, the allocation is retried with minimal
      address reset to zero and at last without the requirement for mirrored
      regions.
      
      Let's consolidate various fallbacks handling and make them more
      consistent for physical and virtual variants.  Most of the fallback
      handling is moved to memblock_alloc_range_nid() and it now handles node
      and mirror fallbacks.
      
      The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
      physical address of the allocated range and converts it to virtual
      address.
      
      The fallback for allocation below the specified minimal address remains
      in memblock_alloc_internal() because memblock_alloc_range_nid() is used
      by CMA with exact requirement for lower bounds.
      
      The memblock_phys_alloc_nid() function is completely dropped as it is not
      used anywhere outside memblock and its only usage can be replaced by a
      call to memblock_alloc_range_nid().
      
      [rppt@linux.ibm.com: fix parameter order in memblock_phys_alloc_try_nid()]
        Link: http://lkml.kernel.org/r/20190203113915.GC8620@rapoport-lnx
      Link: http://lkml.kernel.org/r/1548057848-15136-11-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92d12f95
    • M
      memblock: drop memblock_alloc_base() · 0ba9e6ed
      Mike Rapoport 提交于
      The memblock_alloc_base() function tries to allocate a memory up to the
      limit specified by its max_addr parameter and panics if the allocation
      fails.  Replace its usage with memblock_phys_alloc_range() and make the
      callers check the return value and panic in case of error.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba9e6ed
    • M
      memblock: drop __memblock_alloc_base() · 42b46aef
      Mike Rapoport 提交于
      The __memblock_alloc_base() function tries to allocate a memory up to
      the limit specified by its max_addr parameter.  Depending on the value
      of this parameter, the __memblock_alloc_base() can is replaced with the
      appropriate memblock_phys_alloc*() variant.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-9-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NRob Herring <robh@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42b46aef
    • M
      memblock: memblock_phys_alloc(): don't panic · ecc3e771
      Mike Rapoport 提交于
      Make the memblock_phys_alloc() function an inline wrapper for
      memblock_phys_alloc_range() and update the memblock_phys_alloc() callers
      to check the returned value and panic in case of error.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-8-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecc3e771
    • M
      memblock: memblock_phys_alloc_try_nid(): don't panic · 33755574
      Mike Rapoport 提交于
      The memblock_phys_alloc_try_nid() function tries to allocate memory from
      the requested node and then falls back to allocation from any node in
      the system.  The memblock_alloc_base() fallback used by this function
      panics if the allocation fails.
      
      Replace the memblock_alloc_base() fallback with the direct call to
      memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid()
      callers to check the returned value and panic in case of error.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33755574
    • M
      memblock: emphasize that memblock_alloc_range() returns a physical address · 8a770c2a
      Mike Rapoport 提交于
      Rename memblock_alloc_range() to memblock_phys_alloc_range() to
      emphasize that it returns a physical address.
      
      While on it, remove the 'enum memblock_flags' parameter from this
      function as its only user anyway sets it to MEMBLOCK_NONE, which is the
      default for the most of memblock allocations.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-6-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a770c2a
    • M
      memblock: drop memblock_alloc_base_nid() · 53d818d2
      Mike Rapoport 提交于
      memblock_alloc_base_nid() is a oneliner wrapper for
      memblock_alloc_range_nid() without any side effect.
      
      Replace it's usage by the direct calls to memblock_alloc_range_nid().
      
      Link: http://lkml.kernel.org/r/1548057848-15136-5-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53d818d2