1. 05 4月, 2014 1 次提交
    • H
      mm: get_user_pages(write,force) refuse to COW in shared areas · cda540ac
      Hugh Dickins 提交于
      get_user_pages(write=1, force=1) has always had odd behaviour on write-
      protected shared mappings: although it demands FMODE_WRITE-access to the
      underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE
      without that), it ends up with do_wp_page substituting private anonymous
      Copied-On-Write pages for the shared file pages in the area.
      
      That was long ago intentional, as a safety measure to prevent ptrace
      setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently
      corrupting the underlying executable.  Yet exec and dynamic loaders open
      the file read-only, and use MAP_PRIVATE rather than MAP_SHARED.
      
      The traditional odd behaviour still causes surprises and bugs in mm, and
      is probably not what any caller wants - even the comment on the flag
      says "You do not want this" (although it's undoubtedly necessary for
      overriding userspace protections in some contexts, and good when !write).
      
      Let's stop doing that.  But it would be dangerous to remove the long-
      standing safety at this stage, so just make get_user_pages(write,force)
      fail with EFAULT when applied to a write-protected shared area.
      Infiniband may in future want to force write through to underlying
      object: we can add another FOLL_flag later to enable that if required.
      
      Odd though the old behaviour was, there is no doubt that we may turn out
      to break userspace with this change, and have to revert it quickly.
      Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by
      userspace, so only once to prevent spamming the logs); and delay a few
      associated cleanups until this change is proved.
      
      get_user_pages callers who might see trouble from this change:
        ptrace poking, or writing to /proc/<pid>/mem
        drivers/infiniband/
        drivers/media/v4l2-core/
        drivers/gpu/drm/exynos/exynos_drm_gem.c
        drivers/staging/tidspbridge/core/tiomap3430.c
      if they ever apply get_user_pages to write-protected shared mappings
      of an object which was opened for writing.
      
      I went to apply the same change to mm/nommu.c, but retreated.  NOMMU has
      no place for COW, and its VM_flags conventions are not the same: I'd be
      more likely to screw up NOMMU than make an improvement there.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cda540ac
  2. 04 4月, 2014 8 次提交
  3. 26 2月, 2014 2 次提交
  4. 24 1月, 2014 2 次提交
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • D
      mm: print more details for bad_page() · f0b791a3
      Dave Hansen 提交于
      bad_page() is cool in that it prints out a bunch of data about the page.
      But, I can never remember which page flags are good and which are bad,
      or whether ->index or ->mapping is required to be NULL.
      
      This patch allows bad/dump_page() callers to specify a string about why
      they are dumping the page and adds explanation strings to a number of
      places.  It also adds a 'bad_flags' argument to bad_page(), which it
      then dumps out separately from the flags which are actually set.
      
      This way, the messages will show specifically why the page was bad,
      *specifically* which flags it is complaining about, if it was a page
      flag combination which was the problem.
      
      [akpm@linux-foundation.org: switch to pr_alert]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b791a3
  5. 22 1月, 2014 2 次提交
    • K
      mm: create a separate slab for page->ptl allocation · b35f1819
      Kirill A. Shutemov 提交于
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      
      To make sure that it really works this time, some numbers from my test
      machine (just booted, no load):
      
      Before:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        kmalloc-96         31987  32190    128   30    1 : tunables  120   60    8 : slabdata   1073   1073     92
      After:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        page->ptl          27516  28143     72   53    1 : tunables  120   60    8 : slabdata    531    531      9
        kmalloc-96          3853   5280    128   30    1 : tunables  120   60    8 : slabdata    176    176      0
      
      Note that the patch is useful not only for debug case, but also for
      PREEMPT_RT, where spinlock_t is always bloated.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35f1819
    • D
      dma-debug: introduce debug_dma_assert_idle() · 0abdd7a8
      Dan Williams 提交于
      Record actively mapped pages and provide an api for asserting a given
      page is dma inactive before execution proceeds.  Placing
      debug_dma_assert_idle() in cow_user_page() flagged the violation of the
      dma-api in the NET_DMA implementation (see commit 77873803 "net_dma:
      mark broken").
      
      The implementation includes the capability to count, in a limited way,
      repeat mappings of the same page that occur without an intervening
      unmap.  This 'overlap' counter is limited to the few bits of tag space
      in a radix tree.  This mechanism is added to mitigate false negative
      cases where, for example, a page is dma mapped twice and
      debug_dma_assert_idle() is called after the page is un-mapped once.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0abdd7a8
  6. 21 12月, 2013 2 次提交
  7. 21 11月, 2013 1 次提交
  8. 15 11月, 2013 5 次提交
  9. 13 11月, 2013 1 次提交
  10. 29 10月, 2013 1 次提交
    • M
      mm: numa: Sanitize task_numa_fault() callsites · c61109e3
      Mel Gorman 提交于
      There are three callers of task_numa_fault():
      
       - do_huge_pmd_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_pmd_numa_page():
           Accounts not at all when the page isn't migrated, otherwise
           accounts against the node we migrated towards.
      
      This seems wrong to me; all three sites should have the same
      sementaics, furthermore we should accounts against where the page
      really is, we already know where the task is.
      
      So modify all three sites to always account; we did after all receive
      the fault; and always account to where the page is after migration,
      regardless of success.
      
      They all still differ on when they clear the PTE/PMD; ideally that
      would get sorted too.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c61109e3
  11. 25 10月, 2013 1 次提交
  12. 17 10月, 2013 2 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
    • C
      mm: migration: do not lose soft dirty bit if page is in migration state · c3d16e16
      Cyrill Gorcunov 提交于
      If page migration is turned on in config and the page is migrating, we
      may lose the soft dirty bit.  If fork and mprotect are called on
      migrating pages (once migration is complete) pages do not obtain the
      soft dirty bit in the correspond pte entries.  Fix it adding an
      appropriate test on swap entries.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3d16e16
  13. 09 10月, 2013 11 次提交
  14. 13 9月, 2013 1 次提交