1. 29 3月, 2018 3 次提交
    • M
      mm/page_owner: fix recursion bug after changing skip entries · 299815a4
      Maninder Singh 提交于
      This patch fixes commit 5f48f0bd ("mm, page_owner: skip unnecessary
      stack_trace entries").
      
      Because if we skip first two entries then logic of checking count value
      as 2 for recursion is broken and code will go in one depth recursion.
      
      so we need to check only one call of _RET_IP(__set_page_owner) while
      checking for recursion.
      
      Current Backtrace while checking for recursion:-
      
        (save_stack)             from (__set_page_owner)  // (But recursion returns true here)
        (__set_page_owner)       from (get_page_from_freelist)
        (get_page_from_freelist) from (__alloc_pages_nodemask)
        (__alloc_pages_nodemask) from (depot_save_stack)
        (depot_save_stack)       from (save_stack)       // recursion should return true here
        (save_stack)             from (__set_page_owner)
        (__set_page_owner)       from (get_page_from_freelist)
        (get_page_from_freelist) from (__alloc_pages_nodemask+)
        (__alloc_pages_nodemask) from (depot_save_stack)
        (depot_save_stack)       from (save_stack)
        (save_stack)             from (__set_page_owner)
        (__set_page_owner)       from (get_page_from_freelist)
      
      Correct Backtrace with fix:
      
        (save_stack)             from (__set_page_owner) // recursion returned true here
        (__set_page_owner)       from (get_page_from_freelist)
        (get_page_from_freelist) from (__alloc_pages_nodemask+)
        (__alloc_pages_nodemask) from (depot_save_stack)
        (depot_save_stack)       from (save_stack)
        (save_stack)             from (__set_page_owner)
        (__set_page_owner)       from (get_page_from_freelist)
      
      Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
      Fixes: 5f48f0bd ("mm, page_owner: skip unnecessary stack_trace entries")
      Signed-off-by: NManinder Singh <maninder1.s@samsung.com>
      Signed-off-by: NVaneet Narang <v.narang@samsung.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@techadventures.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ayush Mittal <ayush.m@samsung.com>
      Cc: Prakash Gupta <guptap@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Vasyl Gomonovych <gomonovych@gmail.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: <pankaj.m@samsung.com>
      Cc: Vaneet Narang <v.narang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      299815a4
    • M
      ipc/shm.c: add split function to shm_vm_ops · 3d942ee0
      Mike Kravetz 提交于
      If System V shmget/shmat operations are used to create a hugetlbfs
      backed mapping, it is possible to munmap part of the mapping and split
      the underlying vma such that it is not huge page aligned.  This will
      untimately result in the following BUG:
      
        kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
        CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
        NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
        REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
        MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
        CFAR: c00000000036ee44 SOFTE: 1
        NIP __unmap_hugepage_range+0xa4/0x760
        LR __unmap_hugepage_range_final+0x28/0x50
        Call Trace:
          0x7115e4e00000 (unreliable)
          __unmap_hugepage_range_final+0x28/0x50
          unmap_single_vma+0x11c/0x190
          unmap_vmas+0x94/0x140
          exit_mmap+0x9c/0x1d0
          mmput+0xa8/0x1d0
          do_exit+0x360/0xc80
          do_group_exit+0x60/0x100
          SyS_exit_group+0x24/0x30
          system_call+0x58/0x6c
        ---[ end trace ee88f958a1c62605 ]---
      
      This bug was introduced by commit 31383c68 ("mm, hugetlbfs:
      introduce ->split() to vm_operations_struct").  A split function was
      added to vm_operations_struct to determine if a mapping can be split.
      This was mostly for device-dax and hugetlbfs mappings which have
      specific alignment constraints.
      
      Mappings initiated via shmget/shmat have their original vm_ops
      overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
      original vm_ops if needed.  Add such a split function to shm_vm_ops.
      
      Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
      Fixes: 31383c68 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d942ee0
    • S
      mm, slab: memcg_link the SLAB's kmem_cache · 880cd276
      Shakeel Butt 提交于
      All the root caches are linked into slab_root_caches which was
      introduced by the commit 510ded33 ("slab: implement slab_root_caches
      list") but it missed to add the SLAB's kmem_cache.
      
      While experimenting with opt-in/opt-out kmem accounting, I noticed
      system crashes due to NULL dereference inside cache_from_memcg_idx()
      while deferencing kmem_cache.memcg_params.memcg_caches.  The upstream
      clean kernel will not see these crashes but SLAB should be consistent
      with SLUB which does linked its boot caches (kmem_cache_node and
      kmem_cache) into slab_root_caches.
      
      Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
      Fixes: 510ded33 ("slab: implement slab_root_caches list")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      880cd276
  2. 26 3月, 2018 8 次提交
  3. 25 3月, 2018 3 次提交
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · e43d40b3
      Linus Torvalds 提交于
      Pull mqueuefs revert from Eric Biederman:
       "This fixes a regression that came in the merge window for v4.16.
      
        The problem is that the permissions for mounting and using the
        mqueuefs filesystem are broken. The necessary permission check is
        missing letting people who should not be able to mount mqueuefs mount
        mqueuefs. The field sb->s_user_ns is set incorrectly not allowing the
        mounter of mqueuefs to remount and otherwise have proper control over
        the filesystem.
      
        Al Viro and I see the path to the necessary fixes differently and I am
        not even certain at this point he actually sees all of the necessary
        fixes. Given a couple weeks we can probably work something out but I
        don't see the review being resolved in time for the final v4.16. I
        don't want v4.16 shipping with a nasty regression. So unfortunately I
        am sending a revert"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        Revert "mqueue: switch to on-demand creation of internal mount"
      e43d40b3
    • E
      Revert "mqueue: switch to on-demand creation of internal mount" · cfb2f6f6
      Eric W. Biederman 提交于
      This reverts commit 36735a6a.
      
      Aleksa Sarai <asarai@suse.de> writes:
      > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
      >
      > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
      > which was introduced by 36735a6a ("mqueue: switch to on-demand
      > creation of internal mount").
      >
      > Basically, the reproducer boils down to being able to mount mqueue if
      > you create a new user namespace, even if you don't unshare the IPC
      > namespace.
      >
      > Previously this was not possible, and you would get an -EPERM. The mount
      > is the *host* mqueue mount, which is being cached and just returned from
      > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
      > it was intentional -- since I'm not familiar with mqueue).
      >
      > To me it looks like there is a missing permission check. I've included a
      > patch below that I've compile-tested, and should block the above case.
      > Can someone please tell me if I'm missing something? Is this actually
      > safe?
      >
      > [1]: https://github.com/docker/docker/issues/36674
      
      The issue is a lot deeper than a missing permission check.  sb->s_user_ns
      was is improperly set as well.  So in addition to the filesystem being
      mounted when it should not be mounted, so things are not allow that should
      be.
      
      We are practically to the release of 4.16 and there is no agreement between
      Al Viro and myself on what the code should looks like to fix things properly.
      So revert the code to what it was before so that we can take our time
      and discuss this properly.
      
      Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount")
      Reported-by: NFelix Abecassis <fabecassis@nvidia.com>
      Reported-by: NAleksa Sarai <asarai@suse.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      cfb2f6f6
    • L
      Merge tag 'pinctrl-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · bcfc1f45
      Linus Torvalds 提交于
      Pull pin control fixes from Linus Walleij:
       "Two fixes for pin control for v4.16:
      
         - Renesas SH-PFC: remove a duplicate clkout pin which was causing
           crashes
      
         - fix Samsung out of bounds exceptions"
      
      * tag 'pinctrl-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: samsung: Validate alias coming from DT
        pinctrl: sh-pfc: r8a7795: remove duplicate of CLKOUT pin in pinmux_pins[]
      bcfc1f45
  4. 24 3月, 2018 14 次提交
  5. 23 3月, 2018 12 次提交
    • L
      Merge branch 'akpm' (patches from Andrew) · f36b7534
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "13 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, thp: do not cause memcg oom for thp
        mm/vmscan: wake up flushers for legacy cgroups too
        Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
        mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
        mm/thp: do not wait for lock_page() in deferred_split_scan()
        mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
        x86/mm: implement free pmd/pte page interfaces
        mm/vmalloc: add interfaces to free unmapped page table
        h8300: remove extraneous __BIG_ENDIAN definition
        hugetlbfs: check for pgoff value overflow
        lockdep: fix fs_reclaim warning
        MAINTAINERS: update Mark Fasheh's e-mail
        mm/mempolicy.c: avoid use uninitialized preferred_node
      f36b7534
    • L
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8401c72c
      Linus Torvalds 提交于
      Pull libnvdimm fixes from Dan Williams:
       "Two regression fixes, two bug fixes for older issues, two fixes for
        new functionality added this cycle that have userspace ABI concerns,
        and a small cleanup. These have appeared in a linux-next release and
        have a build success report from the 0day robot.
      
         * The 4.16 rework of altmap handling led to some configurations
           leaking page table allocations due to freeing from the altmap
           reservation rather than the page allocator.
      
           The impact without the fix is leaked memory and a WARN() message
           when tearing down libnvdimm namespaces. The rework also missed a
           place where error handling code needed to be removed that can lead
           to a crash if devm_memremap_pages() fails.
      
         * acpi_map_pxm_to_node() had a latent bug whereby it could
           misidentify the closest online node to a given proximity domain.
      
         * Block integrity handling was reworked several kernels back to allow
           calling add_disk() after setting up the integrity profile.
      
           The nd_btt and nd_blk drivers are just now catching up to fix
           automatic partition detection at driver load time.
      
         * The new peristence_domain attribute, a platform indicator of
           whether cpu caches are powerfail protected for example, is meant to
           be a single value enum and not a set of flags.
      
           This oversight was caught while reviewing new userspace code in
           libndctl to communicate the attribute.
      
           Fix this new enabling up so that we are not stuck with an unwanted
           userspace ABI"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm, nfit: fix persistence domain reporting
        libnvdimm, region: hide persistence_domain when unknown
        acpi, numa: fix pxm to online numa node associations
        x86, memremap: fix altmap accounting at free
        libnvdimm: remove redundant assignment to pointer 'dev'
        libnvdimm, {btt, blk}: do integrity setup before add_disk()
        kernel/memremap: Remove stale devres_free() call
      8401c72c
    • L
      Merge tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux · 9ec7ccc8
      Linus Torvalds 提交于
      Pull drm fixes from Dave Airlie:
       "A bunch of fixes all over the place (core, i915, amdgpu, imx, sun4i,
        ast, tegra, vmwgfx), nothing too serious or worrying at this stage.
      
         - one uapi fix to stop multi-planar images with getfb
      
         - Sun4i error path and clock fixes
      
         - udl driver mmap offset fix
      
         - i915 DP MST and GPU reset fixes
      
         - vmwgfx mutex and black screen fixes
      
         - imx array underflow fix and vblank fix
      
         - amdgpu: display fixes
      
         - exynos devicetree fix
      
         - ast mode fix"
      
      * tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux: (29 commits)
        drm/ast: Fixed 1280x800 Display Issue
        drm: udl: Properly check framebuffer mmap offsets
        drm/i915: Specify which engines to reset following semaphore/event lockups
        drm/vmwgfx: Fix a destoy-while-held mutex problem.
        drm/vmwgfx: Fix black screen and device errors when running without fbdev
        drm: Reject getfb for multi-plane framebuffers
        drm/amd/display: Add one to EDID's audio channel count when passing to DC
        drm/amd/display: We shouldn't set format_default on plane as atomic driver
        drm/amd/display: Fix FMT truncation programming
        drm/amd/display: Allow truncation to 10 bits
        drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()'
        drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()'
        drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
        drm/amd/display: fix dereferencing possible ERR_PTR()
        drm/amd/display: Refine disable VGA
        drm/tegra: Shutdown on driver unbind
        drm/tegra: dsi: Don't disable regulator on ->exit()
        drm/tegra: dc: Detach IOMMU group from domain only once
        dt-bindings: exynos: Document #sound-dai-cells property of the HDMI node
        drm/imx: move arming of the vblank event to atomic_flush
        ...
      9ec7ccc8
    • D
      mm, thp: do not cause memcg oom for thp · 9d3c3354
      David Rientjes 提交于
      Commit 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and
      madvised allocations") changed the page allocator to no longer detect
      thp allocations based on __GFP_NORETRY.
      
      It did not, however, modify the mem cgroup try_charge() path to avoid
      oom kill for either khugepaged collapsing or thp faulting.  It is never
      expected to oom kill a process to allocate a hugepage for thp; reclaim
      is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
      (and charging) should fallback instead of oom killing processes.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
      Fixes: 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d3c3354
    • A
      mm/vmscan: wake up flushers for legacy cgroups too · 1c610d5f
      Andrey Ryabinin 提交于
      Commit 726d061f ("mm: vmscan: kick flushers when we encounter dirty
      pages on the LRU") added flusher invocation to shrink_inactive_list()
      when many dirty pages on the LRU are encountered.
      
      However, shrink_inactive_list() doesn't wake up flushers for legacy
      cgroup reclaim, so the next commit bbef9384 ("mm: vmscan: remove old
      flusher wakeup from direct reclaim path") removed the only source of
      flusher's wake up in legacy mem cgroup reclaim path.
      
      This leads to premature OOM if there is too many dirty pages in cgroup:
          # mkdir /sys/fs/cgroup/memory/test
          # echo $$ > /sys/fs/cgroup/memory/test/tasks
          # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          # dd if=/dev/zero of=tmp_file bs=1M count=100
          Killed
      
          dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
      
          Call Trace:
           dump_stack+0x46/0x65
           dump_header+0x6b/0x2ac
           oom_kill_process+0x21c/0x4a0
           out_of_memory+0x2a5/0x4b0
           mem_cgroup_out_of_memory+0x3b/0x60
           mem_cgroup_oom_synchronize+0x2ed/0x330
           pagefault_out_of_memory+0x24/0x54
           __do_page_fault+0x521/0x540
           page_fault+0x45/0x50
      
          Task in /test killed as a result of limit of /test
          memory: usage 51200kB, limit 51200kB, failcnt 73
          memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
          kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
                  mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
      	    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
          Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
          Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
          oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Wake up flushers in legacy cgroup reclaim too.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
      Fixes: bbef9384 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c610d5f
    • D
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek 提交于
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • K
      mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() · b3cd54b2
      Kirill A. Shutemov 提交于
      shmem_unused_huge_shrink() gets called from reclaim path.  Waiting for
      page lock may lead to deadlock there.
      
      There was a bug report that may be attributed to this:
      
        http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      We can test for the PageTransHuge() outside the page lock as we only
      need protection against splitting the page under us.  Holding pin oni
      the page is enough for this.
      
      Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NEric Wheeler <linux-mm@lists.ewheeler.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3cd54b2
    • K
      mm/thp: do not wait for lock_page() in deferred_split_scan() · fa41b900
      Kirill A. Shutemov 提交于
      deferred_split_scan() gets called from reclaim path.  Waiting for page
      lock may lead to deadlock there.
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
      Fixes: 9a982250 ("thp: introduce deferred_split_huge_page()")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa41b900
    • K
      mm/khugepaged.c: convert VM_BUG_ON() to collapse fail · fece2029
      Kirill A. Shutemov 提交于
      khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
      mapped.  We do not collapse such pages.  See check
      khugepaged_scan_pmd().
      
      But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
      somebody managed to instantiate THP in the range and then split the PMD
      back to PTEs we would have a problem --
      VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.
      
      It's possible since we drop mmap_sem during collapse to re-take for
      write.
      
      Replace the VM_BUG_ON() with graceful collapse fail.
      
      Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
      Fixes: b1caa957 ("khugepaged: ignore pmd tables with THP mapped with ptes")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fece2029
    • T
      x86/mm: implement free pmd/pte page interfaces · 28ee90fe
      Toshi Kani 提交于
      Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
      clear a given pud/pmd entry and free up lower level page table(s).
      
      The address range associated with the pud/pmd entry must have been
      purged by INVLPG.
      
      Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28ee90fe
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
    • A
      h8300: remove extraneous __BIG_ENDIAN definition · 1705f7c5
      Arnd Bergmann 提交于
      A bugfix I did earlier caused a build regression on h8300, which defines
      the __BIG_ENDIAN macro in a slightly different way than the generic
      code:
      
        arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined
      
      We don't need to define it here, as the same macro is already provided
      by the linux/byteorder/big_endian.h, and that version does not conflict.
      
      While this is a v4.16 regression, my earlier patch also got backported
      to the 4.14 and 4.15 stable kernels, so we need the fixup there as well.
      
      Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de
      Fixes: 101110f6 ("Kbuild: always define endianess in kconfig.h")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1705f7c5