1. 01 2月, 2018 2 次提交
  2. 30 11月, 2017 1 次提交
    • N
      fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() · 72639e6d
      Nadav Amit 提交于
      hugetlfs_fallocate() currently performs put_page() before unlock_page().
      This scenario opens a small time window, from the time the page is added
      to the page cache, until it is unlocked, in which the page might be
      removed from the page-cache by another core.  If the page is removed
      during this time windows, it might cause a memory corruption, as the
      wrong page will be unlocked.
      
      It is arguable whether this scenario can happen in a real system, and
      there are several mitigating factors.  The issue was found by code
      inspection (actually grep), and not by actually triggering the flow.
      Yet, since putting the page before unlocking is incorrect it should be
      fixed, if only to prevent future breakage or someone copy-pasting this
      code.
      
      Mike said:
       "I am of the opinion that this does not need to be sent to stable.
        Although the ordering is current code is incorrect, there is no way
        for this to be a problem with current locking. In addition, I verified
        that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
        for hugetlbfs and other filesystems is addressed in 3a77d214 ("mm:
        fadvise: avoid fadvise for fs without backing device")"
      
      Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
      Fixes: 70c3547e ("hugetlbfs: add hugetlbfs_fallocate()")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72639e6d
  3. 16 11月, 2017 2 次提交
  4. 03 11月, 2017 1 次提交
  5. 09 9月, 2017 2 次提交
    • D
      lib/interval_tree: fast overlap detection · f808c13f
      Davidlohr Bueso 提交于
      Allow interval trees to quickly check for overlaps to avoid unnecesary
      tree lookups in interval_tree_iter_first().
      
      As of this patch, all interval tree flavors will require using a
      'rb_root_cached' such that we can have the leftmost node easily
      available.  While most users will make use of this feature, those with
      special functions (in addition to the generic insert, delete, search
      calls) will avoid using the cached option as they can do funky things
      with insertions -- for example, vma_interval_tree_insert_after().
      
      [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
        Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Christian Benvenuti <benve@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f808c13f
    • J
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse 提交于
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
  6. 07 9月, 2017 3 次提交
  7. 11 7月, 2017 1 次提交
  8. 06 7月, 2017 1 次提交
    • D
      hugetlbfs: Implement show_options · 4a25220d
      David Howells 提交于
      Implement the show_options superblock op for hugetlbfs as part of a bid to
      get rid of s_options and generic_show_options() to make it easier to
      implement a context-based mount where the mount options can be passed
      individually over a file descriptor.
      
      Note that the uid and gid should possibly be displayed relative to the
      viewer's user namespace.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4a25220d
  9. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  10. 14 4月, 2017 1 次提交
    • M
      hugetlbfs: fix offset overflow in hugetlbfs mmap · 045c7a3f
      Mike Kravetz 提交于
      If mmap() maps a file, it can be passed an offset into the file at which
      the mapping is to start.  Offset could be a negative value when
      represented as a loff_t.  The offset plus length will be used to update
      the file size (i_size) which is also a loff_t.
      
      Validate the value of offset and offset + length to make sure they do
      not overflow and appear as negative.
      
      Found by syzcaller with commit ff8c0c53 ("mm/hugetlb.c: don't call
      region_abort if region_chg fails") applied.  Prior to this commit, the
      overflow would still occur but we would luckily return ENOMEM.
      
      To reproduce:
      
         mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);
      
      Resulted in,
      
        kernel BUG at mm/hugetlb.c:742!
        Call Trace:
         hugetlbfs_evict_inode+0x80/0xa0
         evict+0x24a/0x620
         iput+0x48f/0x8c0
         dentry_unlink_inode+0x31f/0x4d0
         __dentry_kill+0x292/0x5e0
         dput+0x730/0x830
         __fput+0x438/0x720
         ____fput+0x1a/0x20
         task_work_run+0xfe/0x180
         exit_to_usermode_loop+0x133/0x150
         syscall_return_slowpath+0x184/0x1c0
         entry_SYSCALL_64_fastpath+0xab/0xad
      
      Fixes: ff8c0c53 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
      Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.comReported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045c7a3f
  11. 01 4月, 2017 1 次提交
    • M
      hugetlbfs: initialize shared policy as part of inode allocation · 4742a35d
      Mike Kravetz 提交于
      Any time after inode allocation, destroy_inode can be called.  The
      hugetlbfs inode contains a shared_policy structure, and
      mpol_free_shared_policy is unconditionally called as part of
      hugetlbfs_destroy_inode.  Initialize the policy as part of inode
      allocation so that any quick (error path) calls to destroy_inode will be
      handed an initialized policy.
      
      syzkaller fuzzer found this bug, that resulted in the following:
      
          BUG: KASAN: user-memory-access in atomic_inc
          include/asm-generic/atomic-instrumented.h:87 [inline] at addr
          000000131730bd7a
          BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
          kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
          Write of size 4 by task syz-executor6/14086
          CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
           __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
           lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
           __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
           _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
           mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
           hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
           alloc_inode+0x10d/0x180 fs/inode.c:216
           new_inode_pseudo+0x69/0x190 fs/inode.c:889
           new_inode+0x1c/0x40 fs/inode.c:918
           hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
           hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
           newseg+0x422/0xd30 ipc/shm.c:575
           ipcget_new ipc/util.c:285 [inline]
           ipcget+0x21e/0x580 ipc/util.c:639
           SYSC_shmget ipc/shm.c:673 [inline]
           SyS_shmget+0x158/0x230 ipc/shm.c:657
           entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      Analysis provided by Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      
      Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4742a35d
  12. 02 3月, 2017 1 次提交
  13. 25 12月, 2016 1 次提交
  14. 08 10月, 2016 1 次提交
  15. 28 9月, 2016 1 次提交
  16. 27 9月, 2016 2 次提交
  17. 22 9月, 2016 1 次提交
  18. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  19. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  20. 16 1月, 2016 3 次提交
    • M
      mm/hugetlbfs: unmap pages if page fault raced with hole punch · 4aae8d1c
      Mike Kravetz 提交于
      Page faults can race with fallocate hole punch.  If a page fault happens
      between the unmap and remove operations, the page is not removed and
      remains within the hole.  This is not the desired behavior.  The race is
      difficult to detect in user level code as even in the non-race case, a
      page within the hole could be faulted back in before fallocate returns.
      If userfaultfd is expanded to support hugetlbfs in the future, this race
      will be easier to observe.
      
      If this race is detected and a page is mapped, the remove operation
      (remove_inode_hugepages) will unmap the page before removing.  The unmap
      within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
      so that no other faults will be processed until the page is removed.
      
      The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
      remove_inode_hugepages to satisfy the new reference.
      
      [akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4aae8d1c
    • M
      fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list() · 9aacdd35
      Mike Kravetz 提交于
      Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine.  The
      argument end is of type pgoff_t.  It was being converted to a vaddr
      offset and passed to unmap_hugepage_range.  However, end was also being
      used as an argument to the vma_interval_tree_foreach controlling loop.
      In addition, the conversion of end to vaddr offset was incorrect.
      
      hugetlb_vmtruncate_list is called as part of a file truncate or
      fallocate hole punch operation.
      
      When truncating a hugetlbfs file, this bug could prevent some pages from
      being unmapped.  This is possible if there are multiple vmas mapping the
      file, and there is a sufficiently sized hole between the mappings.  The
      size of the hole between two vmas (A,B) must be such that the starting
      virtual address of B is greater than (ending virtual address of A <<
      PAGE_SHIFT).  In this case, the pages in B would not be unmapped.  If
      pages are not properly unmapped during truncate, the following BUG is
      hit:
      
      	kernel BUG at fs/hugetlbfs/inode.c:428!
      
      In the fallocate hole punch case, this bug could prevent pages from
      being unmapped as in the truncate case.  However, for hole punch the
      result is that unmapped pages will not be removed during the operation.
      For hole punch, it is also possible that more pages than desired will be
      unmapped.  This unnecessary unmapping will cause page faults to
      reestablish the mappings on subsequent page access.
      
      Fixes: 1bfad99a (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.3]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9aacdd35
    • K
      mm: fix locking order in mm_take_all_locks() · 88f306b6
      Kirill A. Shutemov 提交于
      Dmitry Vyukov has reported[1] possible deadlock (triggered by his
      syzkaller fuzzer):
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&hugetlbfs_i_mmap_rwsem_key);
                                     lock(&mapping->i_mmap_rwsem);
                                     lock(&hugetlbfs_i_mmap_rwsem_key);
        lock(&mapping->i_mmap_rwsem);
      
      Both traces points to mm_take_all_locks() as a source of the problem.
      It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
      mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.
      
      huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
      and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
      take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
      rest of VMAs.
      
      The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
      
      [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f306b6
  21. 15 1月, 2016 3 次提交
    • P
      hugetlb: make mm and fs code explicitly non-modular · 3e89e1c5
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      config HUGETLBFS
              bool "HugeTLB file system support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that when
      reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular case,
      the init ordering gets moved to earlier levels when we use the more
      appropriate initcalls here.
      
      Originally I had the fs part and the mm part as separate commits, just
      by happenstance of the nature of how I detected these non-modular use
      cases.  But that can possibly introduce regressions if the patch merge
      ordering puts the fs part 1st -- as the 0-day testing reported a splat
      at mount time.
      
      Investigating with "initcall_debug" showed that the delta was
      init_hugetlbfs_fs being called _before_ hugetlb_init instead of after.  So
      both the fs change and the mm change are here together.
      
      In addition, it worked before due to luck of link order, since they were
      both in the same initcall category.  So we now have the fs part using
      fs_initcall, and the mm part using subsys_initcall, which puts it one
      bucket earlier.  It now passes the basic sanity test that failed in
      earlier 0-day testing.
      
      We delete the MODULE_LICENSE tag and capture that information at the top
      of the file alongside author comments, etc.
      
      We don't replace module.h with init.h since the file already has that.
      Also note that MODULE_ALIAS is a no-op for non-modular code.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e89e1c5
    • N
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer 提交于
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  22. 09 12月, 2015 1 次提交
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  23. 21 11月, 2015 1 次提交
    • M
      mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes · 1817889e
      Mike Kravetz 提交于
      Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
      punch code.  These problems are in the routine remove_inode_hugepages and
      mostly occur in the case where there are holes in the range of pages to be
      removed.  These holes could be the result of a previous hole punch or
      simply sparse allocation.  The current code could access pages outside the
      specified range.
      
      remove_inode_hugepages handles both hole punch and truncate operations.
      Page index handling was fixed/cleaned up so that the loop index always
      matches the page being processed.  The code now only makes a single pass
      through the range of pages as it was determined page faults could not race
      with truncate.  A cond_resched() was added after removing up to
      PAGEVEC_SIZE pages.
      
      Some totally unnecessary code in hugetlbfs_fallocate() that remained from
      early development was also removed.
      
      Tested with fallocate tests submitted here:
      http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
      And, some ftruncate tests under development
      
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>	[4.3]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1817889e
  24. 09 9月, 2015 3 次提交
    • M
      hugetlbfs: add hugetlbfs_fallocate() · 70c3547e
      Mike Kravetz 提交于
      This is based on the shmem version, but it has diverged quite a bit.  We
      have no swap to worry about, nor the new file sealing.  Add
      synchronication via the fault mutex table to coordinate page faults,
      fallocate allocation and fallocate hole punch.
      
      What this allows us to do is move physical memory in and out of a
      hugetlbfs file without having it mapped.  This also gives us the ability
      to support MADV_REMOVE since it is currently implemented using
      fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
      a hugetlbfs file, which wasn't possible before.
      
      hugetlbfs fallocate only operates on whole huge pages.
      
      Based on code by Dave Hansen.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70c3547e
    • M
      hugetlbfs: truncate_hugepages() takes a range of pages · b5cec28d
      Mike Kravetz 提交于
      Modify truncate_hugepages() to take a range of pages (start, end)
      instead of simply start.  If an end value of LLONG_MAX is passed, the
      current "truncate" functionality is maintained.  Existing callers are
      modified to pass LLONG_MAX as end of range.  By keying off end ==
      LLONG_MAX, the routine behaves differently for truncate and hole punch.
      Page removal is now synchronized with page allocation via faults by
      using the fault mutex table.  The hole punch case can experience the
      rare region_del error and must handle accordingly.
      
      Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
      the case where region_del returns an error.
      
      Since the routine handles more than just the truncate case, it is
      renamed to remove_inode_hugepages().  To be consistent, the routine
      truncate_huge_page() is renamed remove_huge_page().
      
      Downstream of remove_inode_hugepages(), the routine
      hugetlb_unreserve_pages() is also modified to take a range of pages.
      hugetlb_unreserve_pages is modified to detect an error from region_del and
      pass it back to the caller.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5cec28d
    • M
      hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete · 1bfad99a
      Mike Kravetz 提交于
      fallocate hole punch will want to unmap a specific range of pages.
      Modify the existing hugetlb_vmtruncate_list() routine to take a
      start/end range.  If end is 0, this indicates all pages after start
      should be unmapped.  This is the same as the existing truncate
      functionality.  Modify existing callers to add 0 as end of range.
      
      Since the routine will be used in hole punch as well as truncate
      operations, it is more appropriately renamed to hugetlb_vmdelete_list().
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bfad99a
  25. 07 8月, 2015 1 次提交
    • S
      ipc: use private shmem or hugetlbfs inodes for shm segments. · e1832f29
      Stephen Smalley 提交于
      The shm implementation internally uses shmem or hugetlbfs inodes for shm
      segments.  As these inodes are never directly exposed to userspace and
      only accessed through the shm operations which are already hooked by
      security modules, mark the inodes with the S_PRIVATE flag so that inode
      security initialization and permission checking is skipped.
      
      This was motivated by the following lockdep warning:
      
        ======================================================
         [ INFO: possible circular locking dependency detected ]
         4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
        -------------------------------------------------------
         httpd/1597 is trying to acquire lock:
         (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
         but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
         which lock already depends on the new lock.
         the existing dependency chain (in reverse order) is:
         -> #3 (&mm->mmap_sem){++++++}:
              lock_acquire+0xc7/0x270
              __might_fault+0x7a/0xa0
              filldir+0x9e/0x130
              xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
              xfs_readdir+0x1b4/0x330 [xfs]
              xfs_file_readdir+0x2b/0x30 [xfs]
              iterate_dir+0x97/0x130
              SyS_getdents+0x91/0x120
              entry_SYSCALL_64_fastpath+0x12/0x76
         -> #2 (&xfs_dir_ilock_class){++++.+}:
              lock_acquire+0xc7/0x270
              down_read_nested+0x57/0xa0
              xfs_ilock+0x167/0x350 [xfs]
              xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
              xfs_attr_get+0xbd/0x190 [xfs]
              xfs_xattr_get+0x3d/0x70 [xfs]
              generic_getxattr+0x4f/0x70
              inode_doinit_with_dentry+0x162/0x670
              sb_finish_set_opts+0xd9/0x230
              selinux_set_mnt_opts+0x35c/0x660
              superblock_doinit+0x77/0xf0
              delayed_superblock_init+0x10/0x20
              iterate_supers+0xb3/0x110
              selinux_complete_init+0x2f/0x40
              security_load_policy+0x103/0x600
              sel_write_load+0xc1/0x750
              __vfs_write+0x37/0x100
              vfs_write+0xa9/0x1a0
              SyS_write+0x58/0xd0
              entry_SYSCALL_64_fastpath+0x12/0x76
        ...
      Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
      Reported-by: NMorten Stevens <mstevens@fedoraproject.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1832f29
  26. 25 6月, 2015 1 次提交
  27. 16 4月, 2015 1 次提交