1. 03 4月, 2020 40 次提交
    • U
      KVM: SVM: Split svm_vcpu_run inline assembly to separate file · 199cd1d7
      Uros Bizjak 提交于
      The compiler (GCC) does not like the situation, where there is inline
      assembly block that clobbers all available machine registers in the
      middle of the function. This situation can be found in function
      svm_vcpu_run in file kvm/svm.c and results in many register spills and
      fills to/from stack frame.
      
      This patch fixes the issue with the same approach as was done for
      VMX some time ago. The big inline assembly is moved to a separate
      assembly .S file, taking into account all ABI requirements.
      
      There are two main benefits of the above approach:
      
      * elimination of several register spills and fills to/from stack
      frame, and consequently smaller function .text size. The binary size
      of svm_vcpu_run is lowered from 2019 to 1626 bytes.
      
      * more efficient access to a register save array. Currently, register
      save array is accessed as:
      
          7b00:    48 8b 98 28 02 00 00     mov    0x228(%rax),%rbx
          7b07:    48 8b 88 18 02 00 00     mov    0x218(%rax),%rcx
          7b0e:    48 8b 90 20 02 00 00     mov    0x220(%rax),%rdx
      
      and passing ia pointer to a register array as an argument to a function one gets:
      
        12:    48 8b 48 08              mov    0x8(%rax),%rcx
        16:    48 8b 50 10              mov    0x10(%rax),%rdx
        1a:    48 8b 58 18              mov    0x18(%rax),%rbx
      
      As a result, the total size, considering that the new function size is 229
      bytes, gets lowered by 164 bytes.
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      199cd1d7
    • J
      KVM: SVM: Move SEV code to separate file · eaf78265
      Joerg Roedel 提交于
      Move the SEV specific parts of svm.c into the new sev.c file.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Message-Id: <20200324094154.32352-5-joro@8bytes.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eaf78265
    • J
      KVM: SVM: Move AVIC code to separate file · ef0f6496
      Joerg Roedel 提交于
      Move the AVIC related functions from svm.c to the new avic.c file.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Message-Id: <20200324094154.32352-4-joro@8bytes.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ef0f6496
    • J
      KVM: SVM: Move Nested SVM Implementation to nested.c · 883b0a91
      Joerg Roedel 提交于
      Split out the code for the nested SVM implementation and move it to a
      separate file.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Message-Id: <20200324094154.32352-3-joro@8bytes.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      883b0a91
    • J
      kVM SVM: Move SVM related files to own sub-directory · 46a010dd
      Joerg Roedel 提交于
      Move svm.c and pmu_amd.c into their own arch/x86/kvm/svm/
      subdirectory.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Message-Id: <20200324094154.32352-2-joro@8bytes.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      46a010dd
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8c1b724d
      Linus Torvalds 提交于
      Pull kvm updates from Paolo Bonzini:
       "ARM:
         - GICv4.1 support
      
         - 32bit host removal
      
        PPC:
         - secure (encrypted) using under the Protected Execution Framework
           ultravisor
      
        s390:
         - allow disabling GISA (hardware interrupt injection) and protected
           VMs/ultravisor support.
      
        x86:
         - New dirty bitmap flag that sets all bits in the bitmap when dirty
           page logging is enabled; this is faster because it doesn't require
           bulk modification of the page tables.
      
         - Initial work on making nested SVM event injection more similar to
           VMX, and less buggy.
      
         - Various cleanups to MMU code (though the big ones and related
           optimizations were delayed to 5.8). Instead of using cr3 in
           function names which occasionally means eptp, KVM too has
           standardized on "pgd".
      
         - A large refactoring of CPUID features, which now use an array that
           parallels the core x86_features.
      
         - Some removal of pointer chasing from kvm_x86_ops, which will also
           be switched to static calls as soon as they are available.
      
         - New Tigerlake CPUID features.
      
         - More bugfixes, optimizations and cleanups.
      
        Generic:
         - selftests: cleanups, new MMU notifier stress test, steal-time test
      
         - CSV output for kvm_stat"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits)
        x86/kvm: fix a missing-prototypes "vmread_error"
        KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y
        KVM: VMX: Add a trampoline to fix VMREAD error handling
        KVM: SVM: Annotate svm_x86_ops as __initdata
        KVM: VMX: Annotate vmx_x86_ops as __initdata
        KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup()
        KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection
        KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes
        KVM: VMX: Configure runtime hooks using vmx_x86_ops
        KVM: VMX: Move hardware_setup() definition below vmx_x86_ops
        KVM: x86: Move init-only kvm_x86_ops to separate struct
        KVM: Pass kvm_init()'s opaque param to additional arch funcs
        s390/gmap: return proper error code on ksm unsharing
        KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move()
        KVM: Fix out of range accesses to memslots
        KVM: X86: Micro-optimize IPI fastpath delay
        KVM: X86: Delay read msr data iff writes ICR MSR
        KVM: PPC: Book3S HV: Add a capability for enabling secure guests
        KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs
        KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs
        ...
      8c1b724d
    • L
      Merge tag 'x86-urgent-2020-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f14a9532
      Linus Torvalds 提交于
      Pull x86 fix from Ingo Molnar:
       "A single fix addressing Sparse warnings. <asm/bitops.h> is changed
        non-trivially to avoid the warnings, but generated code is not
        supposed to be affected"
      
      * tag 'x86-urgent-2020-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Fix bitops.h warning with a moved cast
      f14a9532
    • L
      Merge branch 'next-integrity' of... · 7f218319
      Linus Torvalds 提交于
      Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      
      Pull integrity updates from Mimi Zohar:
       "Just a couple of updates for linux-5.7:
      
         - A new Kconfig option to enable IMA architecture specific runtime
           policy rules needed for secure and/or trusted boot, as requested.
      
         - Some message cleanup (eg. pr_fmt, additional error messages)"
      
      * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
        ima: add a new CONFIG for loading arch-specific policies
        integrity: Remove duplicate pr_fmt definitions
        IMA: Add log statements for failure conditions
        IMA: Update KBUILD_MODNAME for IMA files to ima
      7f218319
    • L
      Merge branch 'akpm' (patches from Andrew) · 6cad420c
      Linus Torvalds 提交于
      Merge updates from Andrew Morton:
       "A large amount of MM, plenty more to come.
      
        Subsystems affected by this patch series:
         - tools
         - kthread
         - kbuild
         - scripts
         - ocfs2
         - vfs
         - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
               sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
               hugetlbfs, hugetlb"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
        include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
        mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
        selftests/vm: fix map_hugetlb length used for testing read and write
        mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
        mm/hugetlb.c: clean code by removing unnecessary initialization
        hugetlb_cgroup: add hugetlb_cgroup reservation docs
        hugetlb_cgroup: add hugetlb_cgroup reservation tests
        hugetlb: support file_region coalescing again
        hugetlb_cgroup: support noreserve mappings
        hugetlb_cgroup: add accounting for shared mappings
        hugetlb: disable region_add file_region coalescing
        hugetlb_cgroup: add reservation accounting for private mappings
        mm/hugetlb_cgroup: fix hugetlb_cgroup migration
        hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
        hugetlb_cgroup: add hugetlb_cgroup reservation counter
        hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
        hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
        mm/memblock.c: remove redundant assignment to variable max_addr
        mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
        mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
        ...
      6cad420c
    • L
      Merge tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 7be97138
      Linus Torvalds 提交于
      Pull xfs updates from Darrick Wong:
       "There's a lot going on this cycle with cleanups in the log code, the
        btree code, and the xattr code.
      
        We're tightening of metadata validation and online fsck checking, and
        introducing a common btree rebuilding library so that we can refactor
        xfs_repair and introduce online repair in a future cycle.
      
        We also fixed a few visible bugs -- most notably there's one in
        getdents that we introduced in 5.6; and a fix for hangs when disabling
        quotas.
      
        This series has been running fstests & other QA in the background for
        over a week and looks good so far.
      
        I anticipate sending a second pull request next week. That batch will
        change how xfs interacts with memory reclaim; how the log batches and
        throttles log items; how hard writes near ENOSPC will try to squeeze
        more space out of the filesystem; and hopefully fix the last of the
        umount hangs after a catastrophic failure. That should ease a lot of
        problems when running at the limits, but for now I'm leaving that in
        for-next for another week to make sure we got all the subtleties
        right.
      
        Summary:
      
         - Fix a hard to trigger race between iclog error checking and log
           shutdown.
      
         - Strengthen the AGF verifier.
      
         - Ratelimit some of the more spammy error messages.
      
         - Remove the icdinode uid/gid members and just use the ones in the
           vfs inode.
      
         - Hold ILOCK across insert/collapse range.
      
         - Clean up the extended attribute interfaces.
      
         - Clean up the attr flags mess.
      
         - Restore PF_MEMALLOC after exiting xfsaild thread to avoid
           triggering warnings in the process accounting code.
      
         - Remove the flexibly-sized array from struct xfs_agfl to eliminate
           compiler warnings about unaligned pointers and packed structures.
      
         - Various macro and typedef removals.
      
         - Stale metadata buffers if we decide they're corrupt outside of a
           verifier.
      
         - Check directory data/block/free block owners.
      
         - Fix a UAF when aborting inactivation of a corrupt xattr fork.
      
         - Teach online scrub to report failed directory and attr name lookups
           as a metadata corruption instead of a runtime error.
      
         - Avoid potential buffer overflows in sysfs files by using scnprintf.
      
         - Fix a regression in getdents lookups due to a mistake in pointer
           arithmetic.
      
         - Refactor btree cursor private data structures to use anonymous
           unions.
      
         - Cleanups in the log unmounting code.
      
         - Fix a potential mishandling of ENOMEM errors on multi-block
           directory buffer lookups.
      
         - Fix an incorrect test in the block allocation code.
      
         - Cleanups and name prefix shortening in the scrub code.
      
         - Introduce btree bulk loading code for online repair and scrub.
      
         - Fix a quotaoff log item leak (and hang) when the fs goes down
           midway through a quotaoff operation.
      
         - Remove di_version from the incore inode.
      
         - Refactor some of the log shutdown checking code.
      
         - Record the forcing of the log unmount records in the log force
           counters.
      
         - Fix a longstanding bug where quotacheck would purge the
           administrator's default quota grace interval and warning limits.
      
         - Reduce memory usage when scrubbing directory and xattr trees.
      
         - Don't let fsfreeze race with GETFSMAP or online scrub.
      
         - Handle bio_add_page failures more gracefully in xlog_write_iclog"
      
      * tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (108 commits)
        xfs: prohibit fs freezing when using empty transactions
        xfs: shutdown on failure to add page to log bio
        xfs: directory bestfree check should release buffers
        xfs: drop all altpath buffers at the end of the sibling check
        xfs: preserve default grace interval during quotacheck
        xfs: remove xlog_state_want_sync
        xfs: move the ioerror check out of xlog_state_clean_iclog
        xfs: refactor xlog_state_clean_iclog
        xfs: remove the aborted parameter to xlog_state_done_syncing
        xfs: simplify log shutdown checking in xfs_log_release_iclog
        xfs: simplify the xfs_log_release_iclog calling convention
        xfs: factor out a xlog_wait_on_iclog helper
        xfs: merge xlog_cil_push into xlog_cil_push_work
        xfs: remove the di_version field from struct icdinode
        xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
        xfs: simplify di_flags2 inheritance in xfs_ialloc
        xfs: only check the superblock version for dinode size calculation
        xfs: add a new xfs_sb_version_has_v3inode helper
        xfs: fix unmount hang and memory leak on shutdown during quotaoff
        xfs: factor out quotaoff intent AIL removal and memory free
        ...
      7be97138
    • L
      Merge tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 7db83c07
      Linus Torvalds 提交于
      Pull hibernation fix from Darrick Wong:
       "Fix a regression where we broke the userspace hibernation driver by
        disallowing writes to the swap device"
      
      * tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        hibernate: Allow uswsusp to write to swap
      7db83c07
    • L
      Merge tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 35a9fafe
      Linus Torvalds 提交于
      Pull iomap updates from Darrick Wong:
       "We're fixing tracepoints and comments in this cycle, so there
        shouldn't be any surprises here.
      
        I anticipate sending a second pull request next week with a single bug
        fix for readahead, but it's still undergoing QA.
      
        Summary:
      
         - Fix a broken tracepoint
      
         - Fix a broken comment"
      
      * tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: fix comments in iomap_dio_rw
        iomap: Remove pgoff from tracepoints
      35a9fafe
    • L
      Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9c577491
      Linus Torvalds 提交于
      Pull vfs pathwalk sanitizing from Al Viro:
       "Massive pathwalk rewrite and cleanups.
      
        Several iterations have been posted; hopefully this thing is getting
        readable and understandable now. Pretty much all parts of pathname
        resolutions are affected...
      
        The branch is identical to what has sat in -next, except for commit
        message in "lift all calls of step_into() out of follow_dotdot/
        follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
        commit message changed there."
      
      * 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
        lookup_open(): don't bother with fallbacks to lookup+create
        atomic_open(): no need to pass struct open_flags anymore
        open_last_lookups(): move complete_walk() into do_open()
        open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
        open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
        open_last_lookups(): consolidate fsnotify_create() calls
        take post-lookup part of do_last() out of loop
        link_path_walk(): sample parent's i_uid and i_mode for the last component
        __nd_alloc_stack(): make it return bool
        reserve_stack(): switch to __nd_alloc_stack()
        pick_link(): take reserving space on stack into a new helper
        pick_link(): more straightforward handling of allocation failures
        fold path_to_nameidata() into its only remaining caller
        pick_link(): pass it struct path already with normal refcounting rules
        fs/namei.c: kill follow_mount()
        non-RCU analogue of the previous commit
        helper for mount rootwards traversal
        follow_dotdot(): be lazy about changing nd->path
        follow_dotdot_rcu(): be lazy about changing nd->path
        follow_dotdot{,_rcu}(): massage loops
        ...
      9c577491
    • Q
      x86/kvm: fix a missing-prototypes "vmread_error" · 514ccc19
      Qian Cai 提交于
      The commit 842f4be9 ("KVM: VMX: Add a trampoline to fix VMREAD error
      handling") removed the declaration of vmread_error() causes a W=1 build
      failure with KVM_WERROR=y. Fix it by adding it back.
      
      arch/x86/kvm/vmx/vmx.c:359:17: error: no previous prototype for 'vmread_error' [-Werror=missing-prototypes]
       asmlinkage void vmread_error(unsigned long field, bool fault)
                       ^~~~~~~~~~~~
      Signed-off-by: NQian Cai <cai@lca.pw>
      Message-Id: <20200402153955.1695-1-cai@lca.pw>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      514ccc19
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · d987ca1c
      Linus Torvalds 提交于
      Pull exec/proc updates from Eric Biederman:
       "This contains two significant pieces of work: the work to sort out
        proc_flush_task, and the work to solve a deadlock between strace and
        exec.
      
        Fixing proc_flush_task so that it no longer requires a persistent
        mount makes improvements to proc possible. The removal of the
        persistent mount solves an old regression that that caused the hidepid
        mount option to only work on remount not on mount. The regression was
        found and reported by the Android folks. This further allows Alexey
        Gladkov's work making proc mount options specific to an individual
        mount of proc to move forward.
      
        The work on exec starts solving a long standing issue with exec that
        it takes mutexes of blocking userspace applications, which makes exec
        extremely deadlock prone. For the moment this adds a second mutex with
        a narrower scope that handles all of the easy cases. Which makes the
        tricky cases easy to spot. With a little luck the code to solve those
        deadlocks will be ready by next merge window"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
        signal: Extend exec_id to 64bits
        pidfd: Use new infrastructure to fix deadlocks in execve
        perf: Use new infrastructure to fix deadlocks in execve
        proc: io_accounting: Use new infrastructure to fix deadlocks in execve
        proc: Use new infrastructure to fix deadlocks in execve
        kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
        kernel: doc: remove outdated comment cred.c
        mm: docs: Fix a comment in process_vm_rw_core
        selftests/ptrace: add test cases for dead-locks
        exec: Fix a deadlock in strace
        exec: Add exec_update_mutex to replace cred_guard_mutex
        exec: Move exec_mmap right after de_thread in flush_old_exec
        exec: Move cleanup of posix timers on exec out of de_thread
        exec: Factor unshare_sighand out of de_thread and call it separately
        exec: Only compute current once in flush_old_exec
        pid: Improve the comment about waiting in zap_pid_ns_processes
        proc: Remove the now unnecessary internal mount of proc
        uml: Create a private mount of proc for mconsole
        uml: Don't consult current to find the proc_mnt in mconsole_proc
        proc: Use a list of inodes to flush from proc
        ...
      d987ca1c
    • M
      include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP · 77d6b909
      Matthew Wilcox (Oracle) 提交于
      It's even more important to check that we don't have a tail page when
      calling hpage_nr_pages() when THP are disabled.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-4-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77d6b909
    • C
      mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS · bb297bb2
      Christophe Leroy 提交于
      When CONFIG_HUGETLB_PAGE is set but not CONFIG_HUGETLBFS, the following
      build failure is encoutered:
      
        In file included from arch/powerpc/mm/fault.c:33:0:
        include/linux/hugetlb.h: In function 'hstate_inode':
        include/linux/hugetlb.h:477:9: error: implicit declaration of function 'HUGETLBFS_SB' [-Werror=implicit-function-declaration]
          return HUGETLBFS_SB(i->i_sb)->hstate;
                 ^
        include/linux/hugetlb.h:477:30: error: invalid type argument of '->' (have 'int')
          return HUGETLBFS_SB(i->i_sb)->hstate;
                                      ^
      
      Gate hstate_inode() with CONFIG_HUGETLBFS instead of CONFIG_HUGETLB_PAGE.
      
      Fixes: a137e1cc ("hugetlbfs: per mount huge page sizes")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Link: http://lkml.kernel.org/r/7e8c3a3c9a587b9cd8a2f146df32a421b961f3a2.1584432148.git.christophe.leroy@c-s.fr
      Link: https://patchwork.ozlabs.org/patch/1255548/#2386036Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb297bb2
    • C
      selftests/vm: fix map_hugetlb length used for testing read and write · cabc30da
      Christophe Leroy 提交于
      Commit fa7b9a80 ("tools/selftest/vm: allow choosing mem size and page
      size in map_hugetlb") added the possibility to change the size of memory
      mapped for the test, but left the read and write test using the default
      value.  This is unnoticed when mapping a length greater than the default
      one, but segfaults otherwise.
      
      Fix read_bytes() and write_bytes() by giving them the real length.
      
      Also fix the call to munmap().
      
      Fixes: fa7b9a80 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NLeonardo Bras <leonardo@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/9a404a13c871c4bd0ba9ede68f69a1225180dd7e.1580978385.git.christophe.leroy@c-s.frSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cabc30da
    • V
      mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() · d4af73e3
      Vlastimil Babka 提交于
      Commit f1e61557 ("mm: pack compound_dtor and compound_order into one
      word in struct page") changed compound_dtor from a pointer to an array
      index in order to pack it.  To check if page has the hugeltbfs
      compound_dtor, we can just compare the index directly without fetching the
      function pointer.  Said commit did that with PageHuge() and we can do the
      same with PageHeadHuge() to make the code a bit smaller and faster.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Neha Agarwal <nehaagarwal@google.com>
      Link: http://lkml.kernel.org/r/20200311172440.6988-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4af73e3
    • M
      mm/hugetlb.c: clean code by removing unnecessary initialization · 353b2de4
      Mateusz Nosek 提交于
      Previously variable 'check_addr' was initialized, but was not read later
      before reassigning.  So the initialization can be removed.
      Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353b2de4
    • M
      hugetlb_cgroup: add hugetlb_cgroup reservation docs · 6566704d
      Mina Almasry 提交于
      Add docs for how to use hugetlb_cgroup reservations, and their behavior.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-9-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6566704d
    • M
      hugetlb_cgroup: add hugetlb_cgroup reservation tests · 29750f71
      Mina Almasry 提交于
      The tests use both shared and private mapped hugetlb memory, and monitors
      the hugetlb usage counter as well as the hugetlb reservation counter.
      They test different configurations such as hugetlb memory usage via
      hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
      MAP_POPULATE.
      
      Also add test for hugetlb reservation reparenting, since this is a subtle
      issue.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Sandipan Das <sandipan@linux.ibm.com>	[powerpc64]
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-8-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29750f71
    • M
      hugetlb: support file_region coalescing again · a9b3f867
      Mina Almasry 提交于
      An earlier patch in this series disabled file_region coalescing in order
      to hang the hugetlb_cgroup uncharge info on the file_region entries.
      
      This patch re-adds support for coalescing of file_region entries.
      Essentially everytime we add an entry, we call a recursive function that
      tries to coalesce the added region with the regions next to it.  The worst
      case call depth for this function is 3: one to coalesce with the region
      next to it, one to coalesce to the region prev, and one to reach the base
      case.
      
      This is an important performance optimization as private mappings add
      their entries page by page, and we could incur big performance costs for
      large mappings with lots of file_region entries in their resv_map.
      
      [almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs]
        Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com
      [almasrymina@google.com: remove check_coalesce_bug debug code]
        Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9b3f867
    • M
      hugetlb_cgroup: support noreserve mappings · 08cf9faf
      Mina Almasry 提交于
      Support MAP_NORESERVE accounting as part of the new counter.
      
      For each hugepage allocation, at allocation time we check if there is a
      reservation for this allocation or not.  If there is a reservation for
      this allocation, then this allocation was charged at reservation time, and
      we don't re-account it.  If there is no reserevation for this allocation,
      we charge the appropriate hugetlb_cgroup.
      
      The hugetlb_cgroup to uncharge for this allocation is stored in
      page[3].private.  We use new APIs added in an earlier patch to set this
      pointer.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08cf9faf
    • M
      hugetlb_cgroup: add accounting for shared mappings · 075a61d0
      Mina Almasry 提交于
      For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
      in the resv_map entries, in file_region->reservation_counter.
      
      After a call to region_chg, we charge the approprate hugetlb_cgroup, and
      if successful, we pass on the hugetlb_cgroup info to a follow up
      region_add call.  When a file_region entry is added to the resv_map via
      region_add, we put the pointer to that cgroup in
      file_region->reservation_counter.  If charging doesn't succeed, we report
      the error to the caller, so that the kernel fails the reservation.
      
      On region_del, which is when the hugetlb memory is unreserved, we also
      uncharge the file_region->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct file_region]
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      075a61d0
    • M
      hugetlb: disable region_add file_region coalescing · 0db9d74e
      Mina Almasry 提交于
      A follow up patch in this series adds hugetlb cgroup uncharge info the
      file_region entries in resv->regions.  The cgroup uncharge info may differ
      for different regions, so they can no longer be coalesced at region_add
      time.  So, disable region coalescing in region_add in this patch.
      
      Behavior change:
      
      Say a resv_map exists like this [0->1], [2->3], and [5->6].
      
      Then a region_chg/add call comes in region_chg/add(f=0, t=5).
      
      Old code would generate resv->regions: [0->5], [5->6].
      New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
      [5->6].
      
      Special care needs to be taken to handle the resv->adds_in_progress
      variable correctly.  In the past, only 1 region would be added for every
      region_chg and region_add call.  But now, each call may add multiple
      regions, so we can no longer increment adds_in_progress by 1 in
      region_chg, or decrement adds_in_progress by 1 after region_add or
      region_abort.  Instead, region_chg calls add_reservation_in_range() to
      count the number of regions needed and allocates those, and that info is
      passed to region_add and region_abort to decrement adds_in_progress
      correctly.
      
      We've also modified the assumption that region_add after region_chg never
      fails.  region_chg now pre-allocates at least 1 region for region_add.  If
      region_add needs more regions than region_chg has allocated for it, then
      it may fail.
      
      [almasrymina@google.com: fix file_region entry allocations]
        Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0db9d74e
    • M
      hugetlb_cgroup: add reservation accounting for private mappings · e9fe92ae
      Mina Almasry 提交于
      Normally the pointer to the cgroup to uncharge hangs off the struct page,
      and gets queried when it's time to free the page.  With hugetlb_cgroup
      reservations, this is not possible.  Because it's possible for a page to
      be reserved by one task and actually faulted in by another task.
      
      The best place to put the hugetlb_cgroup pointer to uncharge for
      reservations is in the resv_map.  But, because the resv_map has different
      semantics for private and shared mappings, the code patch to
      charge/uncharge shared and private mappings is different.  This patch
      implements charging and uncharging for private mappings.
      
      For private mappings, the counter to uncharge is in
      resv_map->reservation_counter.  On initializing the resv_map this is set
      to NULL.  On reservation of a region in private mapping, the tasks
      hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
      resv_map->reservation_counter.
      
      On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct resv_map]
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9fe92ae
    • M
      mm/hugetlb_cgroup: fix hugetlb_cgroup migration · 9808895e
      Mina Almasry 提交于
      Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
      hugetlb reservations") mistakingly doesn't handle the migration of *both*
      the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.
      
      What should happen is that both cgroups shuold be queried from the old
      page, then both set to NULL on the old page, then both inserted into the
      new page.
      
      The mistake also creates the following warning:
      
      mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
      mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
      [-Wunused-but-set-variable]
        struct hugetlb_cgroup *h_cg;
                               ^~~~
      
      Solution is to add the missing steps, namly setting the reservation
      hugetlb_cgroup to NULL on the old page, and setting the fault
      hugetlb_cgroup on the new page.
      
      Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9808895e
    • M
      hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations · 1adc4d41
      Mina Almasry 提交于
      Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
      or hugetlb reservation counter.
      
      Adds a new interface to uncharge a hugetlb_cgroup counter via
      hugetlb_cgroup_uncharge_counter.
      
      Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
      hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1adc4d41
    • M
      hugetlb_cgroup: add hugetlb_cgroup reservation counter · cdc2fcfe
      Mina Almasry 提交于
      These counters will track hugetlb reservations rather than hugetlb memory
      faulted in.  This patch only adds the counter, following patches add the
      charging and uncharging of the counter.
      
      This is patch 1 of an 9 patch series.
      
      Problem:
      
      Currently tasks attempting to reserve more hugetlb memory than is
      available get a failure at mmap/shmget time.  This is thanks to Hugetlbfs
      Reservations [1].  However, if a task attempts to reserve more hugetlb
      memory than its hugetlb_cgroup limit allows, the kernel will allow the
      mmap/shmget call, but will SIGBUS the task when it attempts to fault in
      the excess memory.
      
      We have users hitting their hugetlb_cgroup limits and thus we've been
      looking at this failure mode.  We'd like to improve this behavior such
      that users violating the hugetlb_cgroup limits get an error on mmap/shmget
      time, rather than getting SIGBUS'd when they try to fault the excess
      memory in.  This gives the user an opportunity to fallback more gracefully
      to non-hugetlbfs memory for example.
      
      The underlying problem is that today's hugetlb_cgroup accounting happens
      at hugetlb memory *fault* time, rather than at *reservation* time.  Thus,
      enforcing the hugetlb_cgroup limit only happens at fault time, and the
      offending task gets SIGBUS'd.
      
      Proposed Solution:
      
      A new page counter named
      'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
      slightly different semantics than
      'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
      
      - While usage_in_bytes tracks all *faulted* hugetlb memory,
        rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
        memory faulted in without a prior reservation.
      
      - If a task attempts to reserve more memory than limit_in_bytes allows,
        the kernel will allow it to do so.  But if a task attempts to reserve
        more memory than rsvd.limit_in_bytes, the kernel will fail this
        reservation.
      
      This proposal is implemented in this patch series, with tests to verify
      functionality and show the usage.
      
      Alternatives considered:
      
      1. A new cgroup, instead of only a new page_counter attached to the
         existing hugetlb_cgroup.  Adding a new cgroup seemed like a lot of code
         duplication with hugetlb_cgroup.  Keeping hugetlb related page counters
         under hugetlb_cgroup seemed cleaner as well.
      
      2. Instead of adding a new counter, we considered adding a sysctl that
         modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
         accounting at reservation time rather than fault time.  Adding a new
         page_counter seems better as userspace could, if it wants, choose to
         enforce different cgroups differently: one via limit_in_bytes, and
         another via rsvd.limit_in_bytes.  This could be very useful if you're
         transitioning how hugetlb memory is partitioned on your system one
         cgroup at a time, for example.  Also, someone may find usage for both
         limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
         gives them the option to do so.
      
      Testing:
      - Added tests passing.
      - Used libhugetlbfs for regression testing.
      
      [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.htmlSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdc2fcfe
    • M
      hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race · 87bf91d3
      Mike Kravetz 提交于
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken
      into account as well.
      
      Instead of writing the required complicated code for this rare
      occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
      mode for the duration of page fault processing.  Hold i_mmap_rwsem in
      write mode when modifying i_size.  In this way, truncation can not
      proceed when page faults are being processed.  In addition, i_size
      will not change during fault processing so a single check can be made
      to ensure faults are not beyond (proposed) end of file.  Faults can
      still race with hole punch, but that race is handled by existing code
      and the use of hugetlb_fault_mutex.
      
      With this modification, checks for races with truncation in the page
      fault path can be simplified and removed.  remove_inode_hugepages no
      longer needs to take hugetlb_fault_mutex in the case of truncation.
      Comments are expanded to explain reasoning behind locking.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87bf91d3
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz 提交于
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
    • C
      mm/memblock.c: remove redundant assignment to variable max_addr · 49aef717
      Colin Ian King 提交于
      The variable max_addr is being initialized with a value that is never read
      and it is being updated later with a new value.  The initialization is
      redundant and can be removed.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200228235003.112718-1-colin.king@canonical.com
      Addresses-Coverity: ("Unused value")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49aef717
    • R
      mm: mempolicy: require at least one nodeid for MPOL_PREFERRED · aa9f7d51
      Randy Dunlap 提交于
      Using an empty (malformed) nodelist that is not caught during mount option
      parsing leads to a stack-out-of-bounds access.
      
      The option string that was used was: "mpol=prefer:,".  However,
      MPOL_PREFERRED requires a single node number, which is not being provided
      here.
      
      Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
      nodeid.
      
      Fixes: 095f1fc4 ("mempolicy: rework shmem mpol parsing and display")
      Reported-by: NEntropy Moe <3ntr0py1337@gmail.com>
      Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa9f7d51
    • Y
      mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() · d888fb2b
      Yang Shi 提交于
      The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
      better to dump more debug information by using VM_BUG_ON_VMA() to help
      debugging.
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Li Xinhai" <lixinhai.lxh@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d888fb2b
    • L
      mm/mempolicy: check hugepage migration is supported by arch in vma_migratable() · 20ca87f2
      Li Xinhai 提交于
      vma_migratable() is called to check if pages in vma can be migrated before
      go ahead to further actions.  Currently it is used in below code path:
      
      - task_numa_work
      - mbind
      - move_pages
      
      For hugetlb mapping, whether vma is migratable or not is determined by:
      - CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
      - arch_hugetlb_migration_supported
      
      Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
      alone, and no code should use it directly.  (note that current code in
      vma_migratable don't cause failure or bug because
      unmap_and_move_huge_page() will catch unsupported hugepage and handle it
      properly)
      
      This patch checks the two factors by hugepage_migration_supported for
      impoving code logic and robustness.  It will enable early bail out of
      hugepage migration procedure, but because currently all architecture
      supporting hugepage migration is able to support all page size, we would
      not see performance gain with this patch applied.
      
      vma_migratable() is moved to mm/mempolicy.c, because of the circular
      reference of mempolicy.h and hugetlb.h cause defining it as inline not
      feasible.
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20ca87f2
    • L
      mm/mempolicy: support MPOL_MF_STRICT for huge page mapping · dcf17635
      Li Xinhai 提交于
      MPOL_MF_STRICT is used in mbind() for purposes:
      
      (1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
          MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;
      
      (2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
          check if there is misplaced page which is failed to isolate, or page
          is success on isolate but failed to move, and return -EIO.
      
      For non hugepage mapping, (1) and (2) are implemented as expectation.  For
      hugepage mapping, (1) is not implemented.  And in (2), the part about
      failed to isolate and report -EIO is not implemented.
      
      This patch implements the missed parts for hugepage mapping.  Benefits
      with it applied:
      
      - User space can apply same code logic to handle mbind() on hugepage and
        non hugepage mapping;
      
      - Reliably using MPOL_MF_STRICT alone to check whether there is
        misplaced page or not when bind policy on address range, especially for
        address range which contains both hugepage and non hugepage mapping.
      
      Analysis of potential impact to existing users:
      
      - If MPOL_MF_STRICT alone was previously used, hugetlb pages not
        following the memory policy would not cause an EIO error.  After this
        change, hugetlb pages are treated like all other pages.  If
        MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
        policy an EIO error will be returned.
      
      - For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
        MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
        not be changed by this patch, because failed to isolate and failed to
        move have same effects to users, so their existing code will not be
        impacted.
      
      In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
      mappings' can be removed after this patch is applied.
      
      Mike:
      
      : The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
      : and does not match documentation (as described above).  The special
      : behavior for hugetlb pages ideally should have been removed when hugetlb
      : page migration was introduced.  It is unlikely that anyone relies on
      : today's inconsistent behavior, and removing one more case of special
      : handling for hugetlb pages is a good thing.
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-man <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf17635
    • M
      mm/compaction.c: clean code by removing unnecessary assignment · 250046e7
      Mateusz Nosek 提交于
      Previously 0 was assigned to variable 'last_migrated_pfn'.  But the
      variable is not read after that, so the assignment can be removed.
      Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/20200318174509.15021-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      250046e7
    • S
      mm/compaction: Disable compact_unevictable_allowed on RT · 6923aa0d
      Sebastian Andrzej Siewior 提交于
      Since commit 5bbe3547 ("mm: allow compaction of unevictable pages")
      it is allowed to examine mlocked pages and compact them by default.  On
      -RT even minor pagefaults are problematic because it may take a few 100us
      to resolve them and until then the task is blocked.
      
      Make compact_unevictable_allowed = 0 default and issue a warning on RT if
      it is changed.
      
      [bigeasy@linutronix.de: v5]
        Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
        Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
      Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6923aa0d
    • S
      mm/compaction: really limit compact_unevictable_allowed to 0 and 1 · 964b692d
      Sebastian Andrzej Siewior 提交于
      The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
      `extra*' attribues have been set properly but without
      proc_dointvec_minmax() as the `proc_handler' the limit will not be
      enforced.
      
      Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
      specified range.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964b692d