1. 08 6月, 2018 1 次提交
    • L
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour 提交于
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
  2. 18 5月, 2018 1 次提交
    • W
      proc: do not access cmdline nor environ from file-backed areas · 7f7ccc2c
      Willy Tarreau 提交于
      proc_pid_cmdline_read() and environ_read() directly access the target
      process' VM to retrieve the command line and environment. If this
      process remaps these areas onto a file via mmap(), the requesting
      process may experience various issues such as extra delays if the
      underlying device is slow to respond.
      
      Let's simply refuse to access file-backed areas in these functions.
      For this we add a new FOLL_ANON gup flag that is passed to all calls
      to access_remote_vm(). The code already takes care of such failures
      (including unmapped areas). Accesses via /proc/pid/mem were not
      changed though.
      
      This was assigned CVE-2018-1120.
      
      Note for stable backports: the patch may apply to kernels prior to 4.11
      but silently miss one location; it must be checked that no call to
      access_remote_vm() keeps zero as the last argument.
      Reported-by: NQualys Security Advisory <qsa@qualys.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f7ccc2c
  3. 14 4月, 2018 2 次提交
  4. 06 4月, 2018 1 次提交
  5. 10 3月, 2018 1 次提交
  6. 09 1月, 2018 1 次提交
  7. 16 12月, 2017 1 次提交
    • L
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds 提交于
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
      
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
      
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      
      We'll see.
      
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f37321
  8. 03 12月, 2017 3 次提交
    • A
      __get_user_pages_locked(): get rid of notify_drop argument · e716712f
      Al Viro 提交于
      The only caller that doesn't pass true in it is get_user_pages() and
      it passes NULL in locked.  The only place where we check it is
      	if (notify_locked && lock_dropped && *locked)
      and lock_dropped can become true only if we have locked != NULL.
      In other words, the second part of condition will be false when
      called by get_user_pages().
      
      Just get rid of the argument and turn the condition into
      	if (lock_dropped && *locked)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e716712f
    • A
      get_user_pages_unlocked(): pass true to __get_user_pages_locked() notify_drop · 14cb138d
      Al Viro 提交于
      Equivalent transformation - the only place in __get_user_pages_locked()
      where we look at notify_drop argument is
      	if (notify_drop && lock_dropped && *locked) {
      		up_read(&mm->mmap_sem);
      		*locked = 0;
      	}
      in the very end.  Changing notify_drop from false to true won't change
      behaviour unless *locked is non-zero.  The caller is
              ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
      			      &locked, false, gup_flags | FOLL_TOUCH);
      	if (locked)
      		up_read(&mm->mmap_sem);
      so in that case the original kernel would have done up_read() right after
      return from __get_user_pages_locked(), while the modified one would've done
      it right before the return.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      14cb138d
    • A
      c803c9c6
  9. 30 11月, 2017 2 次提交
    • D
      mm: introduce get_user_pages_longterm · 2bb6d283
      Dan Williams 提交于
      Patch series "introduce get_user_pages_longterm()", v2.
      
      Here is a new get_user_pages api for cases where a driver intends to
      keep an elevated page count indefinitely.  This is distinct from usages
      like iov_iter_get_pages where the elevated page counts are transient.
      The iov_iter_get_pages cases immediately turn around and submit the
      pages to a device driver which will put_page when the i/o operation
      completes (under kernel control).
      
      In the longterm case userspace is responsible for dropping the page
      reference at some undefined point in the future.  This is untenable for
      filesystem-dax case where the filesystem is in control of the lifetime
      of the block / page and needs reasonable limits on how long it can wait
      for pages in a mapping to become idle.
      
      Fixing filesystems to actually wait for dax pages to be idle before
      blocks from a truncate/hole-punch operation are repurposed is saved for
      a later patch series.
      
      Also, allowing longterm registration of dax mappings is a future patch
      series that introduces a "map with lease" semantic where the kernel can
      revoke a lease and force userspace to drop its page references.
      
      I have also tagged these for -stable to purposely break cases that might
      assume that longterm memory registrations for filesystem-dax mappings
      were supported by the kernel.  The behavior regression this policy
      change implies is one of the reasons we maintain the "dax enabled.
      Warning: EXPERIMENTAL, use at your own risk" notification when mounting
      a filesystem in dax mode.
      
      It is worth noting the device-dax interface does not suffer the same
      constraints since it does not support file space management operations
      like hole-punch.
      
      This patch (of 4):
      
      Until there is a solution to the dma-to-dax vs truncate problem it is
      not safe to allow long standing memory registrations against
      filesytem-dax vmas.  Device-dax vmas do not have this problem and are
      explicitly allowed.
      
      This is temporary until a "memory registration with layout-lease"
      mechanism can be implemented for the affected sub-systems (RDMA and
      V4L2).
      
      [akpm@linux-foundation.org: use kcalloc()]
      Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb6d283
    • D
      mm: replace pte_write with pte_access_permitted in fault + gup paths · 5c9d2d5c
      Dan Williams 提交于
      The 'access_permitted' helper is used in the gup-fast path and goes
      beyond the simple _PAGE_RW check to also:
      
       - validate that the mapping is writable from a protection keys
         standpoint
      
       - validate that the pte has _PAGE_USER set since all fault paths where
         pte_write is must be referencing user-memory.
      
      Link: http://lkml.kernel.org/r/151043111604.2842.8051684481794973100.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9d2d5c
  10. 13 9月, 2017 1 次提交
  11. 09 9月, 2017 2 次提交
    • J
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse 提交于
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • Z
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan 提交于
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
  12. 07 9月, 2017 1 次提交
  13. 07 7月, 2017 5 次提交
  14. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  15. 13 6月, 2017 1 次提交
  16. 03 6月, 2017 1 次提交
  17. 04 5月, 2017 1 次提交
  18. 23 4月, 2017 1 次提交
    • I
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar 提交于
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dd29b3d
  19. 18 3月, 2017 7 次提交
  20. 13 3月, 2017 1 次提交
  21. 10 3月, 2017 1 次提交
  22. 02 3月, 2017 1 次提交
  23. 25 2月, 2017 2 次提交
  24. 23 2月, 2017 1 次提交