1. 01 12月, 2017 2 次提交
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9e0600f5
      Linus Torvalds 提交于
      Pull KVM fixes from Paolo Bonzini:
      
       - x86 bugfixes: APIC, nested virtualization, IOAPIC
      
       - PPC bugfix: HPT guests on a POWER9 radix host
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
        KVM: Let KVM_SET_SIGNAL_MASK work as advertised
        KVM: VMX: Fix vmx->nested freeing when no SMI handler
        KVM: VMX: Fix rflags cache during vCPU reset
        KVM: X86: Fix softlockup when get the current kvmclock
        KVM: lapic: Fixup LDR on load in x2apic
        KVM: lapic: Split out x2apic ldr calculation
        KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts
        KVM: vmx: use X86_CR4_UMIP and X86_FEATURE_UMIP
        KVM: x86: Fix CPUID function for word 6 (80000001_ECX)
        KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2
        KVM: x86: ioapic: Preserve read-only values in the redirection table
        KVM: x86: ioapic: Clear Remote IRR when entry is switched to edge-triggered
        KVM: x86: ioapic: Remove redundant check for Remote IRR in ioapic_set_irq
        KVM: x86: ioapic: Don't fire level irq when Remote IRR set
        KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race
        KVM: x86: inject exceptions produced by x86_decode_insn
        KVM: x86: Allow suppressing prints on RDMSR/WRMSR of unhandled MSRs
        KVM: x86: fix em_fxstor() sleeping while in atomic
        KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
        KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
        ...
      9e0600f5
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 22985bf5
      Linus Torvalds 提交于
      Pull s390 fixes from Martin Schwidefsky:
      
       - SPDX identifiers are added to more of the s390 specific files.
      
       - The ELF_ET_DYN_BASE base patch from Kees is reverted, with the change
         some old 31-bit programs crash.
      
       - Bug fixes and cleanups.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (29 commits)
        s390/gs: add compat regset for the guarded storage broadcast control block
        s390: revert ELF_ET_DYN_BASE base changes
        s390: Remove redundant license text
        s390: crypto: Remove redundant license text
        s390: include: Remove redundant license text
        s390: kernel: Remove redundant license text
        s390: add SPDX identifiers to the remaining files
        s390: appldata: add SPDX identifiers to the remaining files
        s390: pci: add SPDX identifiers to the remaining files
        s390: mm: add SPDX identifiers to the remaining files
        s390: crypto: add SPDX identifiers to the remaining files
        s390: kernel: add SPDX identifiers to the remaining files
        s390: sthyi: add SPDX identifiers to the remaining files
        s390: drivers: Remove redundant license text
        s390: crypto: Remove redundant license text
        s390: virtio: add SPDX identifiers to the remaining files
        s390: scsi: zfcp_aux: add SPDX identifier
        s390: net: add SPDX identifiers to the remaining files
        s390: char: add SPDX identifiers to the remaining files
        s390: cio: add SPDX identifiers to the remaining files
        ...
      22985bf5
  2. 30 11月, 2017 38 次提交
    • L
      Merge branch 'akpm' (patches from Andrew) · a0908a1b
      Linus Torvalds 提交于
      Mergr misc fixes from Andrew Morton:
       "28 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits)
        fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate()
        mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine
        autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"
        autofs: revert "autofs: take more care to not update last_used on path walk"
        fs/fat/inode.c: fix sb_rdonly() change
        mm, memcg: fix mem_cgroup_swapout() for THPs
        mm: migrate: fix an incorrect call of prep_transhuge_page()
        kmemleak: add scheduling point to kmemleak_scan()
        scripts/bloat-o-meter: don't fail with division by 0
        fs/mbcache.c: make count_objects() more robust
        Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical"
        mm/madvise.c: fix madvise() infinite loop under special circumstances
        exec: avoid RLIMIT_STACK races with prlimit()
        IB/core: disable memory registration of filesystem-dax vmas
        v4l2: disable filesystem-dax mapping support
        mm: fail get_vaddr_frames() for filesystem-dax mappings
        mm: introduce get_user_pages_longterm
        device-dax: implement ->split() to catch invalid munmap attempts
        mm, hugetlbfs: introduce ->split() to vm_operations_struct
        scripts/faddr2line: extend usage on generic arch
        ...
      a0908a1b
    • N
      fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() · 72639e6d
      Nadav Amit 提交于
      hugetlfs_fallocate() currently performs put_page() before unlock_page().
      This scenario opens a small time window, from the time the page is added
      to the page cache, until it is unlocked, in which the page might be
      removed from the page-cache by another core.  If the page is removed
      during this time windows, it might cause a memory corruption, as the
      wrong page will be unlocked.
      
      It is arguable whether this scenario can happen in a real system, and
      there are several mitigating factors.  The issue was found by code
      inspection (actually grep), and not by actually triggering the flow.
      Yet, since putting the page before unlocking is incorrect it should be
      fixed, if only to prevent future breakage or someone copy-pasting this
      code.
      
      Mike said:
       "I am of the opinion that this does not need to be sent to stable.
        Although the ordering is current code is incorrect, there is no way
        for this to be a problem with current locking. In addition, I verified
        that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
        for hugetlbfs and other filesystems is addressed in 3a77d214 ("mm:
        fadvise: avoid fadvise for fs without backing device")"
      
      Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
      Fixes: 70c3547e ("hugetlbfs: add hugetlbfs_fallocate()")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72639e6d
    • K
      mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine · f4f0a3d8
      Kirill A. Shutemov 提交于
      I made a mistake during converting hugetlb code to 5-level paging: in
      huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().
      
      Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
      if p4d table is not yet allocated.
      
      It only can happen in 5-level paging mode.  In 4-level paging mode
      p4d_offset() always returns pgd, so we are fine.
      
      Link: http://lkml.kernel.org/r/20171122121921.64822-1-kirill.shutemov@linux.intel.com
      Fixes: c2febafc ("mm: convert generic code to 5-level paging")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4f0a3d8
    • I
      autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored" · 5d38f049
      Ian Kent 提交于
      Commit 42f46148 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
      allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
      flag but introduced a semantic change.
      
      In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
      negative dentry case for stat family system calls in follow_automount().
      
      This changed the unconditional triggering of an automount in this case
      to no longer be done and an error returned instead.
      
      This has caused more problems than I expected so reverting the change is
      needed.
      
      In a discussion with Neil Brown it was concluded that the automount(8)
      daemon can implement this change without kernel modifications.  So that
      will be done instead and the autofs module documentation updated with a
      description of the problem and what needs to be done by module users for
      this specific case.
      
      Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net
      Fixes: 42f46148 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
      Signed-off-by: NIan Kent <raven@themaw.net>
      Cc: Neil Brown <neilb@suse.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Colin Walters <walters@redhat.com>
      Cc: Ondrej Holy <oholy@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d38f049
    • I
      autofs: revert "autofs: take more care to not update last_used on path walk" · 43694d4b
      Ian Kent 提交于
      While commit 092a5345 ("autofs: take more care to not update
      last_used on path walk") helped (partially) resolve a problem where
      automounts were not expiring due to aggressive accesses from user space
      it has a side effect for very large environments.
      
      This change helps with the expire problem by making the expire more
      aggressive but, for very large environments, that means more mount
      requests from clients.  When there are a lot of clients that can mean
      fairly significant server load increases.
      
      It turns out I put the last_used in this position to solve this very
      problem and failed to update my own thinking of the autofs expire
      policy.  So the patch being reverted introduces a regression which
      should be fixed.
      
      Link: http://lkml.kernel.org/r/151174729420.6162.1832622523537052460.stgit@pluto.themaw.net
      Fixes: 092a5345 ("autofs: take more care to not update last_used on path walk")
      Signed-off-by: NIan Kent <raven@themaw.net>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Cc: Colin Walters <walters@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ondrej Holy <oholy@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43694d4b
    • O
      fs/fat/inode.c: fix sb_rdonly() change · b6e8e12c
      OGAWA Hirofumi 提交于
      Commit bc98a42c ("VFS: Convert sb->s_flags & MS_RDONLY to
      sb_rdonly(sb)") converted fat_remount():new_rdonly from a bool to an
      int.
      
      However fat_remount() depends upon the compiler's conversion of a
      non-zero integer into boolean `true'.
      
      Fix it by switching `new_rdonly' back into a bool.
      
      Link: http://lkml.kernel.org/r/87mv3d5x51.fsf@mail.parknet.co.jp
      Fixes: bc98a42c ("VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)")
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Joe Perches <joe@perches.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6e8e12c
    • S
      mm, memcg: fix mem_cgroup_swapout() for THPs · d08afa14
      Shakeel Butt 提交于
      Commit d6810d73 ("memcg, THP, swap: make mem_cgroup_swapout()
      support THP") changed mem_cgroup_swapout() to support transparent huge
      page (THP).
      
      However the patch missed one location which should be changed for
      correctly handling THPs.  The resulting bug will cause the memory
      cgroups whose THPs were swapped out to become zombies on deletion.
      
      Link: http://lkml.kernel.org/r/20171128161941.20931-1-shakeelb@google.com
      Fixes: d6810d73 ("memcg, THP, swap: make mem_cgroup_swapout() support THP")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d08afa14
    • Z
      mm: migrate: fix an incorrect call of prep_transhuge_page() · 40a899ed
      Zi Yan 提交于
      In https://lkml.org/lkml/2017/11/20/411, Andrea reported that during
      memory hotplug/hot remove prep_transhuge_page() is called incorrectly on
      non-THP pages for migration, when THP is on but THP migration is not
      enabled.  This leads to a bad state of target pages for migration.
      
      By inspecting the code, if called on a non-THP, prep_transhuge_page()
      will
      
       1) change the value of the mapping of (page + 2), since it is used for
          THP deferred list;
      
       2) change the lru value of (page + 1), since it is used for THP's dtor.
      
      Both can lead to data corruption of these two pages.
      
      Andrea said:
       "Pragmatically and from the point of view of the memory_hotplug subsys,
        the effect is a kernel crash when pages are being migrated during a
        memory hot remove offline and migration target pages are found in a
        bad state"
      
      This patch fixes it by only calling prep_transhuge_page() when we are
      certain that the target page is THP.
      
      Link: http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      Fixes: 8135d892 ("mm: memory_hotplug: memory hotremove supports thp migration")
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Reported-by: NAndrea Reale <ar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40a899ed
    • Y
      kmemleak: add scheduling point to kmemleak_scan() · bde5f6bc
      Yisheng Xie 提交于
      kmemleak_scan() will scan struct page for each node and it can be really
      large and resulting in a soft lockup.  We have seen a soft lockup when
      do scan while compile kernel:
      
        watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287]
       [...]
        Call Trace:
         kmemleak_scan+0x21a/0x4c0
         kmemleak_write+0x312/0x350
         full_proxy_write+0x5a/0xa0
         __vfs_write+0x33/0x150
         vfs_write+0xad/0x1a0
         SyS_write+0x52/0xc0
         do_syscall_64+0x61/0x1a0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Fix this by adding cond_resched every MAX_SCAN_SIZE.
      
      Link: http://lkml.kernel.org/r/1511439788-20099-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Suggested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde5f6bc
    • A
      scripts/bloat-o-meter: don't fail with division by 0 · edbddb83
      Andy Shevchenko 提交于
      Under some circumstances it's possible to get a divider 0 which crashes
      the script.
      
        Traceback (most recent call last):
          File "linux/scripts/bloat-o-meter", line 98, in <module>
            print_result("Function", "tTdDbBrR", 2)
          File "linux/scripts/bloat-o-meter", line 87, in print_result
            (otot, ntot, (ntot - otot)*100.0/otot))
        ZeroDivisionError: float division by zero
      
      Hide this by checking the divider first.
      
      Link: http://lkml.kernel.org/r/20171123171219.31453-1-andriy.shevchenko@linux.intel.comSigned-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Vaneet Narang <v.narang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edbddb83
    • J
      fs/mbcache.c: make count_objects() more robust · d5dabd63
      Jiang Biao 提交于
      When running ltp stress test for 7*24 hours, vmscan occasionally emits
      the following warning continuously:
      
        mb_cache_scan+0x0/0x3f0 negative objects to delete
        nr=-9232265467809300450
        ...
      
      Tracing shows the freeable(mb_cache_count returns) is -1, which causes
      the continuous accumulation and overflow of total_scan.
      
      This patch makes sure that mb_cache_count() cannot return a negative
      value, which makes the mbcache shrinker more robust.
      
      Link: http://lkml.kernel.org/r/1511753419-52328-1-git-send-email-jiang.biao2@zte.com.cnSigned-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zhong.weidong@zte.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5dabd63
    • M
      Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" · 90daf306
      Michal Hocko 提交于
      This reverts commit 0f6d24f8 ("mm/page-writeback.c: print a warning
      if the vm dirtiness settings are illogical") because it causes false
      positive warnings during OOM situations as noticed by Tetsuo Handa:
      
        Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
        Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 2687 3645 3645
        Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 0 958 958
        Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        [...]
        Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
        Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        vm direct limit must be set greater than background limit.
      
      The problem is that both thresh and bg_thresh will be 0 if
      available_memory is less than 4 pages when evaluating
      global_dirtyable_memory.
      
      While this might be worked around the whole point of the warning is
      dubious at best.  We do rely on admins to do sensible things when
      changing tunable knobs.  Dirty memory writeback knobs are not any
      special in that regards so revert the warning rather than adding more
      hacks to work this around.
      
      Debugged by Yafang Shao.
      
      Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
      Fixes: 0f6d24f8 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90daf306
    • C
      mm/madvise.c: fix madvise() infinite loop under special circumstances · 6ea8d958
      chenjie 提交于
      MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
      Unfortunately madvise_willneed() doesn't communicate this information
      properly to the generic madvise syscall implementation.  The calling
      convention is quite subtle there.  madvise_vma() is supposed to either
      return an error or update &prev otherwise the main loop will never
      advance to the next vma and it will keep looping for ever without a way
      to get out of the kernel.
      
      It seems this has been broken since introduction.  Nobody has noticed
      because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
      
      [mhocko@suse.com: rewrite changelog]
      Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
      Fixes: fe77ba6f ("[PATCH] xip: madvice/fadvice: execute in place")
      Signed-off-by: Nchenjie <chenjie6@huawei.com>
      Signed-off-by: Nguoxuenan <guoxuenan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: zhangyi (F) <yi.zhang@huawei.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ea8d958
    • K
      exec: avoid RLIMIT_STACK races with prlimit() · 04e35f44
      Kees Cook 提交于
      While the defense-in-depth RLIMIT_STACK limit on setuid processes was
      protected against races from other threads calling setrlimit(), I missed
      protecting it against races from external processes calling prlimit().
      This adds locking around the change and makes sure that rlim_max is set
      too.
      
      Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
      Fixes: 64701dee ("exec: Use sane stack rlimit under secureexec")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e35f44
    • D
      IB/core: disable memory registration of filesystem-dax vmas · 5f1d43de
      Dan Williams 提交于
      Until there is a solution to the dma-to-dax vs truncate problem it is
      not safe to allow RDMA to create long standing memory registrations
      against filesytem-dax vmas.
      
      Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1d43de
    • D
      v4l2: disable filesystem-dax mapping support · b70131de
      Dan Williams 提交于
      V4L2 memory registrations are incompatible with filesystem-dax that
      needs the ability to revoke dma access to a mapping at will, or
      otherwise allow the kernel to wait for completion of DMA.  The
      filesystem-dax implementation breaks the traditional solution of
      truncate of active file backed mappings since there is no page-cache
      page we can orphan to sustain ongoing DMA.
      
      If v4l2 wants to support long lived DMA mappings it needs to arrange to
      hold a file lease or use some other mechanism so that the kernel can
      coordinate revoking DMA access when the filesystem needs to truncate
      mappings.
      
      Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70131de
    • D
      mm: fail get_vaddr_frames() for filesystem-dax mappings · b7f0554a
      Dan Williams 提交于
      Until there is a solution to the dma-to-dax vs truncate problem it is
      not safe to allow V4L2, Exynos, and other frame vector users to create
      long standing / irrevocable memory registrations against filesytem-dax
      vmas.
      
      [dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan]
        Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7f0554a
    • D
      mm: introduce get_user_pages_longterm · 2bb6d283
      Dan Williams 提交于
      Patch series "introduce get_user_pages_longterm()", v2.
      
      Here is a new get_user_pages api for cases where a driver intends to
      keep an elevated page count indefinitely.  This is distinct from usages
      like iov_iter_get_pages where the elevated page counts are transient.
      The iov_iter_get_pages cases immediately turn around and submit the
      pages to a device driver which will put_page when the i/o operation
      completes (under kernel control).
      
      In the longterm case userspace is responsible for dropping the page
      reference at some undefined point in the future.  This is untenable for
      filesystem-dax case where the filesystem is in control of the lifetime
      of the block / page and needs reasonable limits on how long it can wait
      for pages in a mapping to become idle.
      
      Fixing filesystems to actually wait for dax pages to be idle before
      blocks from a truncate/hole-punch operation are repurposed is saved for
      a later patch series.
      
      Also, allowing longterm registration of dax mappings is a future patch
      series that introduces a "map with lease" semantic where the kernel can
      revoke a lease and force userspace to drop its page references.
      
      I have also tagged these for -stable to purposely break cases that might
      assume that longterm memory registrations for filesystem-dax mappings
      were supported by the kernel.  The behavior regression this policy
      change implies is one of the reasons we maintain the "dax enabled.
      Warning: EXPERIMENTAL, use at your own risk" notification when mounting
      a filesystem in dax mode.
      
      It is worth noting the device-dax interface does not suffer the same
      constraints since it does not support file space management operations
      like hole-punch.
      
      This patch (of 4):
      
      Until there is a solution to the dma-to-dax vs truncate problem it is
      not safe to allow long standing memory registrations against
      filesytem-dax vmas.  Device-dax vmas do not have this problem and are
      explicitly allowed.
      
      This is temporary until a "memory registration with layout-lease"
      mechanism can be implemented for the affected sub-systems (RDMA and
      V4L2).
      
      [akpm@linux-foundation.org: use kcalloc()]
      Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb6d283
    • D
      device-dax: implement ->split() to catch invalid munmap attempts · 9702cffd
      Dan Williams 提交于
      Similar to how device-dax enforces that the 'address', 'offset', and
      'len' parameters to mmap() be aligned to the device's fundamental
      alignment, the same constraints apply to munmap().  Implement ->split()
      to fail munmap calls that violate the alignment constraint.
      
      Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
      with crash signatures of the form:
      
          vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
          next           (null) prev           (null) mm ffff8800b61150c0
          prot 8000000000000027 anon_vma           (null) vm_ops ffffffffa0091240
          pgoff 0 file ffff8800b638ef80 private_data           (null)
          flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
          ------------[ cut here ]------------
          kernel BUG at mm/huge_memory.c:2014!
          [..]
          RIP: 0010:__split_huge_pud+0x12a/0x180
          [..]
          Call Trace:
           unmap_page_range+0x245/0xa40
           ? __vma_adjust+0x301/0x990
           unmap_vmas+0x4c/0xa0
           unmap_region+0xae/0x120
           ? __vma_rb_erase+0x11a/0x230
           do_munmap+0x276/0x410
           vm_munmap+0x6a/0xa0
           SyS_munmap+0x1d/0x30
      
      Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9702cffd
    • D
      mm, hugetlbfs: introduce ->split() to vm_operations_struct · 31383c68
      Dan Williams 提交于
      Patch series "device-dax: fix unaligned munmap handling"
      
      When device-dax is operating in huge-page mode we want it to behave like
      hugetlbfs and fail attempts to split vmas into unaligned ranges.  It
      would be messy to teach the munmap path about device-dax alignment
      constraints in the same (hstate) way that hugetlbfs communicates this
      constraint.  Instead, these patches introduce a new ->split() vm
      operation.
      
      This patch (of 2):
      
      The device-dax interface has similar constraints as hugetlbfs in that it
      requires the munmap path to unmap in huge page aligned units.  Rather
      than add more custom vma handling code in __split_vma() introduce a new
      vm operation to perform this vma specific check.
      
      Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31383c68
    • L
      scripts/faddr2line: extend usage on generic arch · 95a87982
      Liu, Changcheng 提交于
      When cross-compiling, fadd2line should use the binary tool used for the
      target system, rather than that of the host.
      
      Link: http://lkml.kernel.org/r/20171121092911.GA150711@sofiaSigned-off-by: NLiu Changcheng <changcheng.liu@intel.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a87982
    • D
      mm: replace pte_write with pte_access_permitted in fault + gup paths · 5c9d2d5c
      Dan Williams 提交于
      The 'access_permitted' helper is used in the gup-fast path and goes
      beyond the simple _PAGE_RW check to also:
      
       - validate that the mapping is writable from a protection keys
         standpoint
      
       - validate that the pte has _PAGE_USER set since all fault paths where
         pte_write is must be referencing user-memory.
      
      Link: http://lkml.kernel.org/r/151043111604.2842.8051684481794973100.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9d2d5c
    • D
      mm: replace pmd_write with pmd_access_permitted in fault + gup paths · c7da82b8
      Dan Williams 提交于
      The 'access_permitted' helper is used in the gup-fast path and goes
      beyond the simple _PAGE_RW check to also:
      
       - validate that the mapping is writable from a protection keys
         standpoint
      
       - validate that the pte has _PAGE_USER set since all fault paths where
         pmd_write is must be referencing user-memory.
      
      Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7da82b8
    • D
      mm: replace pud_write with pud_access_permitted in fault + gup paths · e7fe7b5c
      Dan Williams 提交于
      The 'access_permitted' helper is used in the gup-fast path and goes
      beyond the simple _PAGE_RW check to also:
      
       - validate that the mapping is writable from a protection keys
         standpoint
      
       - validate that the pte has _PAGE_USER set since all fault paths where
         pud_write is must be referencing user-memory.
      
      [dan.j.williams@intel.com: fix powerpc compile error]
        Link: http://lkml.kernel.org/r/151129127237.37405.16073414520854722485.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: http://lkml.kernel.org/r/151043110453.2842.2166049702068628177.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7fe7b5c
    • D
      mm: switch to 'define pmd_write' instead of __HAVE_ARCH_PMD_WRITE · e4e40e02
      Dan Williams 提交于
      In response to compile breakage introduced by a series that added the
      pud_write helper to x86, Stephen notes:
      
          did you consider using the other paradigm:
      
          In arch include files:
          #define pud_write       pud_write
          static inline int pud_write(pud_t pud)
           .....
      
          Then in include/asm-generic/pgtable.h:
      
          #ifndef pud_write
          tatic inline int pud_write(pud_t pud)
          {
                  ....
          }
          #endif
      
          If you had, then the powerpc code would have worked ... ;-) and many
          of the other interfaces in include/asm-generic/pgtable.h are
          protected that way ...
      
      Given that some architecture already define pmd_write() as a macro, it's
      a net reduction to drop the definition of __HAVE_ARCH_PMD_WRITE.
      
      Link: http://lkml.kernel.org/r/151129126721.37405.13339850900081557813.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Suggested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4e40e02
    • D
      mm: fix device-dax pud write-faults triggered by get_user_pages() · 1501899a
      Dan Williams 提交于
      Currently only get_user_pages_fast() can safely handle the writable gup
      case due to its use of pud_access_permitted() to check whether the pud
      entry is writable.  In the gup slow path pud_write() is used instead of
      pud_access_permitted() and to date it has been unimplemented, just calls
      BUG_ON().
      
          kernel BUG at ./include/linux/hugetlb.h:244!
          [..]
          RIP: 0010:follow_devmap_pud+0x482/0x490
          [..]
          Call Trace:
           follow_page_mask+0x28c/0x6e0
           __get_user_pages+0xe4/0x6c0
           get_user_pages_unlocked+0x130/0x1b0
           get_user_pages_fast+0x89/0xb0
           iov_iter_get_pages_alloc+0x114/0x4a0
           nfs_direct_read_schedule_iovec+0xd2/0x350
           ? nfs_start_io_direct+0x63/0x70
           nfs_file_direct_read+0x1e0/0x250
           nfs_file_read+0x90/0xc0
      
      For now this just implements a simple check for the _PAGE_RW bit similar
      to pmd_write.  However, this implies that the gup-slow-path check is
      missing the extra checks that the gup-fast-path performs with
      pud_access_permitted.  Later patches will align all checks to use the
      'access_permitted' helper if the architecture provides it.
      
      Note that the generic 'access_permitted' helper fallback is the simple
      _PAGE_RW check on architectures that do not define the
      'access_permitted' helper(s).
      
      [dan.j.williams@intel.com: fix powerpc compile error]
        Link: http://lkml.kernel.org/r/151129126165.37405.16031785266675461397.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: http://lkml.kernel.org/r/151043109938.2842.14834662818213616199.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: a00cc7d9 ("mm, x86: add support for PUD-sized transparent hugepages")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1501899a
    • M
      mm/cma: fix alloc_contig_range ret code/potential leak · 63cd4489
      Mike Kravetz 提交于
      If the call __alloc_contig_migrate_range() in alloc_contig_range returns
      -EBUSY, processing continues so that test_pages_isolated() is called
      where there is a tracepoint to identify the busy pages.  However, it is
      possible for busy pages to become available between the calls to these
      two routines.  In this case, the range of pages may be allocated.
      Unfortunately, the original return code (ret == -EBUSY) is still set and
      returned to the caller.  Therefore, the caller believes the pages were
      not allocated and they are leaked.
      
      Update the comment to indicate that allocation is still possible even if
      __alloc_contig_migrate_range returns -EBUSY.  Also, clear return code in
      this case so that it is not accidentally used or returned to caller.
      
      Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
      Fixes: 8ef5849f ("mm/cma: always check which page caused allocation failure")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63cd4489
    • W
      mm, oom_reaper: gather each vma to prevent leaking TLB entry · 687cb088
      Wang Nan 提交于
      tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
      space.  In this case, tlb->fullmm is true.  Some archs like arm64
      doesn't flush TLB when tlb->fullmm is true:
      
        commit 5a7862e8 ("arm64: tlbflush: avoid flushing when fullmm == 1").
      
      Which causes leaking of tlb entries.
      
      Will clarifies his patch:
       "Basically, we tag each address space with an ASID (PCID on x86) which
        is resident in the TLB. This means we can elide TLB invalidation when
        pulling down a full mm because we won't ever assign that ASID to
        another mm without doing TLB invalidation elsewhere (which actually
        just nukes the whole TLB).
      
        I think that means that we could potentially not fault on a kernel
        uaccess, because we could hit in the TLB"
      
      There could be a window between complete_signal() sending IPI to other
      cores and all threads sharing this mm are really kicked off from cores.
      In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
      flush TLB then frees pages.  However, due to the above problem, the TLB
      entries are not really flushed on arm64.  Other threads are possible to
      access these pages through TLB entries.  Moreover, a copy_to_user() can
      also write to these pages without generating page fault, causes
      use-after-free bugs.
      
      This patch gathers each vma instead of gathering full vm space.  In this
      case tlb->fullmm is not true.  The behavior of oom reaper become similar
      to munmapping before do_exit, which should be safe for all archs.
      
      Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      687cb088
    • M
      mm, memory_hotplug: do not back off draining pcp free pages from kworker context · 4b81cb2f
      Michal Hocko 提交于
      drain_all_pages backs off when called from a kworker context since
      commit 0ccce3b9 ("mm, page_alloc: drain per-cpu pages from workqueue
      context") because the original IPI based pcp draining has been replaced
      by a WQ based one and the check wanted to prevent from recursion and
      inter workers dependencies.  This has made some sense at the time
      because the system WQ has been used and one worker holding the lock
      could be blocked while waiting for new workers to emerge which can be a
      problem under OOM conditions.
      
      Since then commit ce612879 ("mm: move pcp and lru-pcp draining into
      single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
      rescuer so we shouldn't depend on any other WQ activity to make a
      forward progress so calling drain_all_pages from a worker context is
      safe as long as this doesn't happen from mm_percpu_wq itself which is
      not the case because all workers are required to _not_ depend on any MM
      locks.
      
      Why is this a problem in the first place? ACPI driven memory hot-remove
      (acpi_device_hotplug) is executed from the worker context.  We end up
      calling __offline_pages to free all the pages and that requires both
      lru_add_drain_all_cpuslocked and drain_all_pages to do their job
      otherwise we can have dangling pages on pcp lists and fail the offline
      operation (__test_page_isolated_in_pageblock would see a page with 0 ref
      count but without PageBuddy set).
      
      Fix the issue by removing the worker check in drain_all_pages.
      lru_add_drain_all_cpuslocked doesn't have this restriction so it works
      as expected.
      
      Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org
      Fixes: 0ccce3b9 ("mm, page_alloc: drain per-cpu pages from workqueue context")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b81cb2f
    • L
      Merge tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux · b9151761
      Linus Torvalds 提交于
      Pull nfsd fixes from Bruce Fields:
       "I screwed up my merge window pull request; I only sent half of what I
        meant to.
      
        There were no new features, just bugfixes of various importance and
        some very minor cleanup, so I think it's all still appropriate for
        -rc2.
      
        Highlights:
      
         - Fixes from Trond for some races in the NFSv4 state code.
      
         - Fix from Naofumi Honda for a typo in the blocked lock notificiation
           code
      
         - Fixes from Vasily Averin for some problems starting and stopping
           lockd especially in network namespaces"
      
      * tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux: (23 commits)
        lockd: fix "list_add double add" caused by legacy signal interface
        nlm_shutdown_hosts_net() cleanup
        race of nfsd inetaddr notifiers vs nn->nfsd_serv change
        race of lockd inetaddr notifiers vs nlmsvc_rqst change
        SUNRPC: make cache_detail structures const
        NFSD: make cache_detail structures const
        sunrpc: make the function arg as const
        nfsd: check for use of the closed special stateid
        nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
        lockd: lost rollback of set_grace_period() in lockd_down_net()
        lockd: added cleanup checks in exit_net hook
        grace: replace BUG_ON by WARN_ONCE in exit_net hook
        nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class
        lockd: remove net pointer from messages
        nfsd: remove net pointer from debug messages
        nfsd: Fix races with check_stateid_generation()
        nfsd: Ensure we check stateid validity in the seqid operation checks
        nfsd: Fix race in lock stateid creation
        nfsd4: move find_lock_stateid
        nfsd: Ensure we don't recognise lock stateids after freeing them
        ...
      b9151761
    • L
      Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 26cd9474
      Linus Torvalds 提交于
      Pull btrfs fixes from David Sterba:
       "We've collected some fixes in since the pre-merge window freeze.
      
        There's technically only one regression fix for 4.15, but the rest
        seems important and candidates for stable.
      
         - fix missing flush bio puts in error cases (is serious, but rarely
           happens)
      
         - fix reporting stat::st_blocks for buffered append writes
      
         - fix space cache invalidation
      
         - fix out of bound memory access when setting zlib level
      
         - fix potential memory corruption when fsync fails in the middle
      
         - fix crash in integrity checker
      
         - incremetnal send fix, path mixup for certain unlink/rename
           combination
      
         - pass flags to writeback so compressed writes can be throttled
           properly
      
         - error handling fixes"
      
      * tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Btrfs: incremental send, fix wrong unlink path after renaming file
        btrfs: tree-checker: Fix false panic for sanity test
        Btrfs: fix list_add corruption and soft lockups in fsync
        btrfs: Fix wild memory access in compression level parser
        btrfs: fix deadlock when writing out space cache
        btrfs: clear space cache inode generation always
        Btrfs: fix reported number of inode blocks after buffered append writes
        Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
        Btrfs: bail out gracefully rather than BUG_ON
        btrfs: dev_alloc_list is not protected by RCU, use normal list_del
        btrfs: add missing device::flush_bio puts
        btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
        Btrfs: add write_flags for compression bio
      26cd9474
    • L
      Merge tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze · 198e0c0c
      Linus Torvalds 提交于
      Pull Microblaze fix from Michal Simek:
       "Add missing header to mmu_context_mm.h"
      
      * tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze:
        microblaze: add missing include to mmu_context_mm.h
      198e0c0c
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · fccfde44
      Linus Torvalds 提交于
      Pull sparc fix from David Miller:
       "Sparc T4 and later cpu bootup regression fix"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: Fix boot on T4 and later.
      fccfde44
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 96c22a49
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) The forcedeth conversion from pci_*() DMA interfaces to dma_*() ones
          missed one spot. From Zhu Yanjun.
      
       2) Missing CRYPTO_SHA256 Kconfig dep in cfg80211, from Johannes Berg.
      
       3) Fix checksum offloading in thunderx driver, from Sunil Goutham.
      
       4) Add SPDX to vm_sockets_diag.h, from Stephen Hemminger.
      
       5) Fix use after free of packet headers in TIPC, from Jon Maloy.
      
       6) "sizeof(ptr)" vs "sizeof(*ptr)" bug in i40e, from Gustavo A R Silva.
      
       7) Tunneling fixes in mlxsw driver, from Petr Machata.
      
       8) Fix crash in fanout_demux_rollover() of AF_PACKET, from Mike
          Maloney.
      
       9) Fix race in AF_PACKET bind() vs. NETDEV_UP notifier, from Eric
          Dumazet.
      
      10) Fix regression in sch_sfq.c due to one of the timer_setup()
          conversions. From Paolo Abeni.
      
      11) SCTP does list_for_each_entry() using wrong struct member, fix from
          Xin Long.
      
      12) Don't use big endian netlink attribute read for
          IFLA_BOND_AD_ACTOR_SYSTEM, it is in cpu endianness. Also from Xin
          Long.
      
      13) Fix mis-initialization of q->link.clock in CBQ scheduler, preventing
          adding filters there. From Jiri Pirko.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
        ethernet: dwmac-stm32: Fix copyright
        net: via: via-rhine: use %p to format void * address instead of %x
        net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit
        myri10ge: Update MAINTAINERS
        net: sched: cbq: create block for q->link.block
        atm: suni: remove extraneous space to fix indentation
        atm: lanai: use %p to format kernel addresses instead of %x
        VSOCK: Don't set sk_state to TCP_CLOSE before testing it
        atm: fore200e: use %pK to format kernel addresses instead of %x
        ambassador: fix incorrect indentation of assignment statement
        vxlan: use __be32 type for the param vni in __vxlan_fdb_delete
        bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM
        sctp: use right member as the param of list_for_each_entry
        sch_sfq: fix null pointer dereference at timer expiration
        cls_bpf: don't decrement net's refcount when offload fails
        net/packet: fix a race in packet_bind() and packet_notifier()
        packet: fix crash in fanout_demux_rollover()
        sctp: remove extern from stream sched
        sctp: force the params with right types for sctp csum apis
        sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail
        ...
      96c22a49
    • D
      sparc64: Fix boot on T4 and later. · e5372cd5
      David S. Miller 提交于
      If we don't put the NG4fls.o object into the same part of
      the link as the generic sparc64 objects for fls() and __fls()
      then the relocation in the branch we use for patching will
      not fit.
      
      Move NG4fls.o into lib-y to fix this problem.
      
      Fixes: 46ad8d2d ("sparc64: Use sparc optimized fls and __fls for T4 and above")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reported-by: NAnatoly Pugachev <matorola@gmail.com>
      Tested-by: NAnatoly Pugachev <matorola@gmail.com>
      e5372cd5
    • L
      vsprintf: don't use 'restricted_pointer()' when not restricting · ef0010a3
      Linus Torvalds 提交于
      Instead, just fall back on the new '%p' behavior which hashes the
      pointer.
      
      Otherwise, '%pK' - that was intended to mark a pointer as restricted -
      just ends up leaking pointers that a normal '%p' wouldn't leak.  Which
      just make the whole thing pointless.
      
      I suspect we should actually get rid of '%pK' entirely, and make it just
      work as '%p' regardless, but this is the minimal obvious fix.  People
      who actually use 'kptr_restrict' should weigh in on which behavior they
      want.
      
      Cc: Tobin Harding <me@tobin.cc>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef0010a3
    • L
      kallsyms: take advantage of the new '%px' format · 668533dc
      Linus Torvalds 提交于
      The conditional kallsym hex printing used a special fixed-width '%lx'
      output (KALLSYM_FMT) in preparation for the hashing of %p, but that
      series ended up adding a %px specifier to help with the conversions.
      
      Use it, and avoid the "print pointer as an unsigned long" code.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668533dc
    • L
      Merge tag 'printk-hash-pointer-4.15-rc2' of git://github.com/tcharding/linux · da6af54d
      Linus Torvalds 提交于
      Pull printk pointer hashing update from Tobin Harding:
       "Here is the patch set that implements hashing of printk specifier %p.
      
        First we have two clean up patches then we do the hashing. Hashing is
        done via the SipHash algorithm. The next patch adds printk specifier
        %px for printing pointers when we _really_ want to see the address i.e
        %px is functionally equivalent to %lx. Final patch in the set fixes
        KASAN since we break it by hashing %p.
      
        For the record here is the justification for the series:
      
          Currently there exist approximately 14 000 places in the Kernel
          where addresses are being printed using an unadorned %p. This
          potentially leaks sensitive information about the Kernel layout in
          memory. Many of these calls are stale, instead of fixing every call
          we hash the address by default before printing. We then add %px to
          provide a way to print the actual address. Although this is
          achievable using %lx, using %px will assist us if we ever want to
          change pointer printing behaviour. %px is more uniquely grep'able
          (there are already >50 000 uses of %lx).
      
          The added advantage of hashing %p is that security is now opt-out,
          if you _really_ want the address you have to work a little harder
          and use %px.
      
        This will of course break some users, forcing code printing needed
        addresses to be updated"
      
      [ I do expect this to be an annoyance, and a number of %px users to be
        added for debuggability. But nobody is willing to audit existing %p
        users for information leaks, and a number of places really only use
        the pointer as an object identifier rather than really 'I need the
        address'.
      
        IOW - sorry for the inconvenience, but it's the least inconvenient of
        the options.    - Linus ]
      
      * tag 'printk-hash-pointer-4.15-rc2' of git://github.com/tcharding/linux:
        kasan: use %px to print addresses instead of %p
        vsprintf: add printk specifier %px
        printk: hash addresses printed with %p
        vsprintf: refactor %pK code out of pointer()
        docs: correct documentation for %pK
      da6af54d