1. 07 4月, 2016 1 次提交
    • I
      mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs · c12d2da5
      Ingo Molnar 提交于
      The pkeys changes brought about a truly hideous set of macros in:
      
        cde70140 ("mm/gup: Overload get_user_pages() functions")
      
      ... which macros are (ab-)using the fact that __VA_ARGS__ can be used
      to shift parameter positions in macro arguments without breaking the
      build and so can be used to call separate C functions depending on
      the number of arguments of the macro.
      
      This allowed easy migration of these 3 GUP APIs, as both these variants
      worked at the C level:
      
        old:
      	ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);
      
        new:
      	ret = get_user_pages(address, 1, 1, 0, &page, NULL);
      
      ... while we also generated a (functionally harmless but noticeable) build
      time warning if the old API was used. As there are over 300 uses of these
      APIs, this trick eased the migration of the API and avoided excessive
      migration pain in linux-next.
      
      Now, with its work done, get rid of all of that complication and ugliness:
      
          3 files changed, 16 insertions(+), 140 deletions(-)
      
      ... where the linecount of the migration hack was further inflated by the
      fact that there are NOMMU variants of these GUP APIs as well.
      
      Much of the conversion was done in linux-next over the past couple of months,
      and Linus recently removed all remaining old API uses from the upstream tree
      in the following upstrea commit:
      
        cb107161 ("Convert straggling drivers to new six-argument get_user_pages()")
      
      There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
      code path that ARM, ARM64 and PowerPC uses.
      
      After this commit any old API usage will break the build.
      
      [ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c12d2da5
  2. 26 3月, 2016 1 次提交
    • M
      mm, oom: introduce oom reaper · aac45363
      Michal Hocko 提交于
      This patch (of 5):
      
      This is based on the idea from Mel Gorman discussed during LSFMM 2015
      and independently brought up by Oleg Nesterov.
      
      The OOM killer currently allows to kill only a single task in a good
      hope that the task will terminate in a reasonable time and frees up its
      memory.  Such a task (oom victim) will get an access to memory reserves
      via mark_oom_victim to allow a forward progress should there be a need
      for additional memory during exit path.
      
      It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
      construct workloads which break the core assumption mentioned above and
      the OOM victim might take unbounded amount of time to exit because it
      might be blocked in the uninterruptible state waiting for an event (e.g.
      lock) which is blocked by another task looping in the page allocator.
      
      This patch reduces the probability of such a lockup by introducing a
      specialized kernel thread (oom_reaper) which tries to reclaim additional
      memory by preemptively reaping the anonymous or swapped out memory owned
      by the oom victim under an assumption that such a memory won't be needed
      when its owner is killed and kicked from the userspace anyway.  There is
      one notable exception to this, though, if the OOM victim was in the
      process of coredumping the result would be incomplete.  This is
      considered a reasonable constrain because the overall system health is
      more important than debugability of a particular application.
      
      A kernel thread has been chosen because we need a reliable way of
      invocation so workqueue context is not appropriate because all the
      workers might be busy (e.g.  allocating memory).  Kswapd which sounds
      like another good fit is not appropriate as well because it might get
      blocked on locks during reclaim as well.
      
      oom_reaper has to take mmap_sem on the target task for reading so the
      solution is not 100% because the semaphore might be held or blocked for
      write but the probability is reduced considerably wrt.  basically any
      lock blocking forward progress as described above.  In order to prevent
      from blocking on the lock without any forward progress we are using only
      a trylock and retry 10 times with a short sleep in between.  Users of
      mmap_sem which need it for write should be carefully reviewed to use
      _killable waiting as much as possible and reduce allocations requests
      done with the lock held to absolute minimum to reduce the risk even
      further.
      
      The API between oom killer and oom reaper is quite trivial.
      wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
      NULL->mm transition and oom_reaper clear this atomically once it is done
      with the work.  This means that only a single mm_struct can be reaped at
      the time.  As the operation is potentially disruptive we are trying to
      limit it to the ncessary minimum and the reaper blocks any updates while
      it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
      exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Suggested-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac45363
  3. 18 3月, 2016 7 次提交
  4. 16 3月, 2016 5 次提交
    • J
      mm: simplify lock_page_memcg() · 62cccb8c
      Johannes Weiner 提交于
      Now that migration doesn't clear page->mem_cgroup of live pages anymore,
      it's safe to make lock_page_memcg() and the memcg stat functions take
      pages, and spare the callers from memcg objects.
      
      [akpm@linux-foundation.org: fix warnings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62cccb8c
    • J
      mm: migrate: do not touch page->mem_cgroup of live pages · 6a93ca8f
      Johannes Weiner 提交于
      Changing a page's memcg association complicates dealing with the page,
      so we want to limit this as much as possible.  Page migration e.g.  does
      not have to do that.  Just like page cache replacement, it can forcibly
      charge a replacement page, and then uncharge the old page when it gets
      freed.  Temporarily overcharging the cgroup by a single page is not an
      issue in practice, and charging is so cheap nowadays that this is much
      preferrable to the headache of messing with live pages.
      
      The only place that still changes the page->mem_cgroup binding of live
      pages is when pages move along with a task to another cgroup.  But that
      path isolates the page from the LRU, takes the page lock, and the move
      lock (lock_page_memcg()).  That means page->mem_cgroup is always stable
      in callers that have the page isolated from the LRU or locked.  Lighter
      unlocked paths, like writeback accounting, can use lock_page_memcg().
      
      [akpm@linux-foundation.org: fix build]
      [vdavydov@virtuozzo.com: fix lockdep splat]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a93ca8f
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • L
      mm/page_poison.c: enable PAGE_POISONING as a separate option · 8823b1db
      Laura Abbott 提交于
      Page poisoning is currently set up as a feature if architectures don't
      have architecture debug page_alloc to allow unmapping of pages.  It has
      uses apart from that though.  Clearing of the pages on free provides an
      increase in security as it helps to limit the risk of information leaks.
      Allow page poisoning to be enabled as a separate option independent of
      kernel_map pages since the two features do separate work.  Because of
      how hiberanation is implemented, the checks on alloc cannot occur if
      hibernation is enabled.  The runtime alloc checks can also be enabled
      with an option when !HIBERNATION.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8823b1db
    • J
      mm/slab: clean up DEBUG_PAGEALLOC processing code · 40b44137
      Joonsoo Kim 提交于
      Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
      some sites.  It makes code unreadable and hard to change.
      
      This patch cleans up this code.  The following patch will change the
      criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
      
      [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40b44137
  5. 19 2月, 2016 2 次提交
    • D
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen 提交于
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d61172b4
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  6. 18 2月, 2016 2 次提交
    • D
      x86/mm/pkeys: Add arch-specific VMA protection bits · 8f62c883
      Dave Hansen 提交于
      Lots of things seem to do:
      
              vma->vm_page_prot = vm_get_page_prot(flags);
      
      and the ptes get created right from things we pull out
      of ->vm_page_prot.  So it is very convenient if we can
      store the protection key in flags and vm_page_prot, just
      like the existing permission bits (_PAGE_RW/PRESENT).  It
      greatly reduces the amount of plumbing and arch-specific
      hacking we have to do in generic code.
      
      This also takes the new PROT_PKEY{0,1,2,3} flags and
      turns *those* in to VM_ flags for vma->vm_flags.
      
      The protection key values are stored in 4 places:
      	1. "prot" argument to system calls
      	2. vma->vm_flags, filled from the mmap "prot"
      	3. vma->vm_page prot, filled from vma->vm_flags
      	4. the PTE itself.
      
      The pseudocode for these for steps are as follows:
      
      	mmap(PROT_PKEY*)
      	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
      	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
      	pte = pfn | vma->vm_page_prot
      
      Note that this provides a new definitions for x86:
      
      	arch_vm_get_page_prot()
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f62c883
    • D
      mm/core, x86/mm/pkeys: Store protection bits in high VMA flags · 63c17fb8
      Dave Hansen 提交于
      vma->vm_flags is an 'unsigned long', so has space for 32 flags
      on 32-bit architectures.  The high 32 bits are unused on 64-bit
      platforms.  We've steered away from using the unused high VMA
      bits for things because we would have difficulty supporting it
      on 32-bit.
      
      Protection Keys are not available in 32-bit mode, so there is
      no concern about supporting this feature in 32-bit mode or on
      32-bit CPUs.
      
      This patch carves out 4 bits from the high half of
      vma->vm_flags and allows architectures to set config option
      to make them available.
      
      Sparse complains about these constants unless we explicitly
      call them "UL".
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63c17fb8
  7. 16 2月, 2016 2 次提交
    • D
      mm/gup: Overload get_user_pages() functions · cde70140
      Dave Hansen 提交于
      The concept here was a suggestion from Ingo.  The implementation
      horrors are all mine.
      
      This allows get_user_pages(), get_user_pages_unlocked(), and
      get_user_pages_locked() to be called with or without the
      leading tsk/mm arguments.  We will give a compile-time warning
      about the old style being __deprecated and we will also
      WARN_ON() if the non-remote version is used for a remote-style
      access.
      
      Doing this, folks will get nice warnings and will not break the
      build.  This should be nice for -next and will hopefully let
      developers fix up their own code instead of maintainers needing
      to do it at merge time.
      
      The way we do this is hideous.  It uses the __VA_ARGS__ macro
      functionality to call different functions based on the number
      of arguments passed to the macro.
      
      There's an additional hack to ensure that our EXPORT_SYMBOL()
      of the deprecated symbols doesn't trigger a warning.
      
      We should be able to remove this mess as soon as -rc1 hits in
      the release after this is merged.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Leon Romanovsky <leon@leon.nu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cde70140
    • D
      mm/gup: Introduce get_user_pages_remote() · 1e987790
      Dave Hansen 提交于
      For protection keys, we need to understand whether protections
      should be enforced in software or not.  In general, we enforce
      protections when working on our own task, but not when on others.
      We call these "current" and "remote" operations.
      
      This patch introduces a new get_user_pages() variant:
      
              get_user_pages_remote()
      
      Which is a replacement for when get_user_pages() is called on
      non-current tsk/mm.
      
      We also introduce a new gup flag: FOLL_REMOTE which can be used
      for the "__" gup variants to get this new behavior.
      
      The uprobes is_trap_at_addr() location holds mmap_sem and
      calls get_user_pages(current->mm) on an instruction address.  This
      makes it a pretty unique gup caller.  Being an instruction access
      and also really originating from the kernel (vs. the app), I opted
      to consider this a 'remote' access where protection keys will not
      be enforced.
      
      Without protection keys, this patch should not change any behavior.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e987790
  8. 04 2月, 2016 2 次提交
    • K
      mm: polish virtual memory accounting · 30bdbb78
      Konstantin Khlebnikov 提交于
      * add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture
      * always account VMAs with flag VM_STACK as stack (as it was before)
      * cleanup classifying helpers
      * update comments and documentation
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Tested-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30bdbb78
    • J
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner 提交于
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their requirement.  However, I did do this on my
        own time because I thought it was an interesting project for me and nobody
        really gave any feedback then as to its utility, so as far as I am
        concerned you could roll back the main thread maps information since the
        information is available in the thread-specific files"
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65376df5
  9. 30 1月, 2016 1 次提交
    • T
      memremap: Change region_intersects() to take @flags and @desc · 1c29f25b
      Toshi Kani 提交于
      Change region_intersects() to identify a target with @flags and
      @desc, instead of @name with strcmp().
      
      Change the callers of region_intersects(), memremap() and
      devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
      IORES_DESC_NONE in @desc when searching System RAM.
      
      Also, export region_intersects() so that the ACPI EINJ error
      injection driver can call this function in a later patch.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jakub Sitnicki <jsitnicki@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-mm <linux-mm@kvack.org>
      Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c29f25b
  10. 16 1月, 2016 14 次提交
  11. 15 1月, 2016 3 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • M
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko 提交于
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259