1. 08 8月, 2020 2 次提交
    • C
      tmpfs: support 64-bit inums per-sb · ea3271f7
      Chris Down 提交于
      The default is still set to inode32 for backwards compatibility, but
      system administrators can opt in to the new 64-bit inode numbers by
      either:
      
      1. Passing inode64 on the command line when mounting, or
      2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
      
      The inode64 and inode32 names are used based on existing precedent from
      XFS.
      
      [hughd@google.com: Kconfig fixes]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvilsSigned-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3271f7
    • C
      tmpfs: per-superblock i_ino support · e809d5f0
      Chris Down 提交于
      Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
      
      In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
      affected tiers, in excess of 10% of hosts show multiple files with
      different content and the same inode number, with some servers even having
      as many as 150 duplicated inode numbers with differing file content.
      
      This causes actual, tangible problems in production.  For example, we have
      complaints from those working on remote caches that their application is
      reporting cache corruptions because it uses (device, inodenum) to
      establish the identity of a particular cache object, but because it's not
      unique any more, the application refuses to continue and reports cache
      corruption.  Even worse, sometimes applications may not even detect the
      corruption but may continue anyway, causing phantom and hard to debug
      behaviour.
      
      In general, userspace applications expect that (device, inodenum) should
      be enough to be uniquely point to one inode, which seems fair enough.  One
      might also need to check the generation, but in this case:
      
      1. That's not currently exposed to userspace
         (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
      2. Even with generation, there shouldn't be two live inodes with the
         same inode number on one device.
      
      In order to mitigate this, we take a two-pronged approach:
      
      1. Moving inum generation from being global to per-sb for tmpfs. This
         itself allows some reduction in i_ino churn. This works on both 64-
         and 32- bit machines.
      2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
         64-bit ino_t only: we allow users to mount tmpfs with a new inode64
         option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
      
      You can see how this compares to previous related patches which didn't
      implement this per-superblock:
      
      - https://patchwork.kernel.org/patch/11254001/
      - https://patchwork.kernel.org/patch/11023915/
      
      This patch (of 2):
      
      get_next_ino has a number of problems:
      
      - It uses and returns a uint, which is susceptible to become overflowed
        if a lot of volatile inodes that use get_next_ino are created.
      - It's global, with no specificity per-sb or even per-filesystem. This
        means it's not that difficult to cause inode number wraparounds on a
        single device, which can result in having multiple distinct inodes
        with the same inode number.
      
      This patch adds a per-superblock counter that mitigates the second case.
      This design also allows us to later have a specific i_ino size per-device,
      for example, allowing users to choose whether to use 32- or 64-bit inodes
      for each tmpfs mount.  This is implemented in the next commit.
      
      For internal shmem mounts which may be less tolerant to spinlock delays,
      we implement a percpu batching scheme which only takes the stat_lock at
      each batch boundary.
      Signed-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
      Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e809d5f0
  2. 25 7月, 2020 1 次提交
  3. 10 6月, 2020 2 次提交
    • M
      mmap locking API: convert mmap_sem comments · c1e8d7c6
      Michel Lespinasse 提交于
      Convert comments that reference mmap_sem to reference mmap_lock instead.
      
      [akpm@linux-foundation.org: fix up linux-next leftovers]
      [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
      [akpm@linux-foundation.org: more linux-next fixups, per Michel]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ying Han <yinghan@google.com>
      Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e8d7c6
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  4. 04 6月, 2020 7 次提交
  5. 22 4月, 2020 3 次提交
  6. 08 4月, 2020 7 次提交
  7. 17 3月, 2020 1 次提交
  8. 19 2月, 2020 1 次提交
  9. 08 2月, 2020 3 次提交
  10. 07 2月, 2020 2 次提交
  11. 14 1月, 2020 1 次提交
    • K
      mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997
      Kirill A. Shutemov 提交于
      Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
      enabled.  But it doesn't work well with above-47bit hint address.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks THP alignment in shmem/tmp:
      shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
      *any* hint address specified.
      
      This can be fixed by requesting the aligned area if the we failed to
      allocated at user-specified hint address.  The request with inflated
      length will also take the user-specified hint address.  This way we will
      not lose an allocation request from the full address space.
      
      [kirill@shutemov.name: fold in a fixup]
        Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
      Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99158997
  12. 02 12月, 2019 4 次提交
  13. 01 12月, 2019 1 次提交
    • K
      shmem: pin the file in shmem_fault() if mmap_sem is dropped · 8897c1b1
      Kirill A. Shutemov 提交于
      syzbot found the following crash:
      
        BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
        Read of size 8 at addr ffff8880a5cf2c50 by task syz-executor.0/26173
      
        CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
           perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
           trace_lock_acquire include/trace/events/lock.h:13 [inline]
           lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
           __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
           _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:338 [inline]
           shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
           __do_fault+0x111/0x540 mm/memory.c:3083
           do_shared_fault mm/memory.c:3535 [inline]
           do_fault mm/memory.c:3613 [inline]
           handle_pte_fault mm/memory.c:3840 [inline]
           __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
           handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
           do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
           __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
           do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
           page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202
      
      It happens if the VMA got unmapped under us while we dropped mmap_sem
      and inode got freed.
      
      Pinning the file if we drop mmap_sem fixes the issue.
      
      Link: http://lkml.kernel.org/r/20190927083908.rhifa4mmaxefc24r@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+03ee87124ee05af991bd@syzkaller.appspotmail.com
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8897c1b1
  14. 10 10月, 2019 1 次提交
  15. 29 9月, 2019 1 次提交
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
  16. 25 9月, 2019 3 次提交