1. 31 10月, 2011 1 次提交
  2. 26 7月, 2011 3 次提交
    • B
      mm/futex: fix futex writes on archs with SW tracking of dirty & young · 2efaca92
      Benjamin Herrenschmidt 提交于
      I haven't reproduced it myself but the fail scenario is that on such
      machines (notably ARM and some embedded powerpc), if you manage to hit
      that futex path on a writable page whose dirty bit has gone from the PTE,
      you'll livelock inside the kernel from what I can tell.
      
      It will go in a loop of trying the atomic access, failing, trying gup to
      "fix it up", getting succcess from gup, go back to the atomic access,
      failing again because dirty wasn't fixed etc...
      
      So I think you essentially hang in the kernel.
      
      The scenario is probably rare'ish because affected architecture are
      embedded and tend to not swap much (if at all) so we probably rarely hit
      the case where dirty is missing or young is missing, but I think Shan has
      a piece of SW that can reliably reproduce it using a shared writable
      mapping & fork or something like that.
      
      On archs who use SW tracking of dirty & young, a page without dirty is
      effectively mapped read-only and a page without young unaccessible in the
      PTE.
      
      Additionally, some architectures might lazily flush the TLB when relaxing
      write protection (by doing only a local flush), and expect a fault to
      invalidate the stale entry if it's still present on another processor.
      
      The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
      "fix it up" by causing get_user_pages() which would then be equivalent to
      taking the fault.
      
      However that isn't the case.  get_user_pages() will not call
      handle_mm_fault() in the case where the PTE seems to have the right
      permissions, regardless of the dirty and young state.  It will eventually
      update those bits ...  in the struct page, but not in the PTE.
      
      Additionally, it will not handle the lazy TLB flushing that can be
      required by some architectures in the fault case.
      
      Basically, gup is the wrong interface for the job.  The patch provides a
      more appropriate one which boils down to just calling handle_mm_fault()
      since what we are trying to do is simulate a real page fault.
      
      The futex code currently attempts to write to user memory within a
      pagefault disabled section, and if that fails, tries to fix it up using
      get_user_pages().
      
      This doesn't work on archs where the dirty and young bits are maintained
      by software, since they will gate access permission in the TLB, and will
      not be updated by gup().
      
      In addition, there's an expectation on some archs that a spurious write
      fault triggers a local TLB flush, and that is missing from the picture as
      well.
      
      I decided that adding those "features" to gup() would be too much for this
      already too complex function, and instead added a new simpler
      fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
      which the futex code can call.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-by: NShan Hai <haishan.bai@gmail.com>
      Tested-by: NShan Hai <haishan.bai@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Darren Hart <darren.hart@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2efaca92
    • K
      mm: preallocate page before lock_page() at filemap COW · 1d65f86d
      KAMEZAWA Hiroyuki 提交于
      Currently we are keeping faulted page locked throughout whole __do_fault
      call (except for page_mkwrite code path) after calling file system's fault
      code.  If we do early COW, we allocate a new page which has to be charged
      for a memcg (mem_cgroup_newpage_charge).
      
      This function, however, might block for unbounded amount of time if memcg
      oom killer is disabled or fork-bomb is running because the only way out of
      the OOM situation is either an external event or OOM-situation fix.
      
      In the end we are keeping the faulted page locked and blocking other
      processes from faulting it in which is not good at all because we are
      basically punishing potentially an unrelated process for OOM condition in
      a different group (I have seen stuck system because of ld-2.11.1.so being
      locked).
      
      We can do test easily.
      
       % cgcreate -g memory:A
       % cgset -r memory.limit_in_bytes=64M A
       % cgset -r memory.memsw.limit_in_bytes=64M A
       % cd kernel_dir; cgexec -g memory:A make -j
      
      Then, the whole system will live-locked until you kill 'make -j'
      by hands (or push reboot...) This is because some important page in a
      a shared library are locked.
      
      Considering again, the new page is not necessary to be allocated
      with lock_page() held. And usual page allocation may dive into
      long memory reclaim loop with holding lock_page() and can cause
      very long latency.
      
      There are 3 ways.
        1. do allocation/charge before lock_page()
           Pros. - simple and can handle page allocation in the same manner.
                   This will reduce holding time of lock_page() in general.
           Cons. - we do page allocation even if ->fault() returns error.
      
        2. do charge after unlock_page(). Even if charge fails, it's just OOM.
           Pros. - no impact to non-memcg path.
           Cons. - implemenation requires special cares of LRU and we need to modify
                   page_add_new_anon_rmap()...
      
        3. do unlock->charge->lock again method.
           Pros. - no impact to non-memcg path.
           Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
                   lock and get it again...
      
      This patch moves "charge" and memory allocation for COW page
      before lock_page(). Then, we can avoid scanning LRU with holding
      a lock on a page and latency under lock_page() will be reduced.
      
      Then, above livelock disappears.
      
      [akpm@linux-foundation.org: fix code layout]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: NLutz Vieweg <lvml@5t9.de>
      Original-idea-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d65f86d
    • A
      mm/memory.c: remove ZAP_BLOCK_SIZE · 6ac47520
      Andrew Morton 提交于
      ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
      Remove i_mmap_lock lockbreak").  So zap it.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ac47520
  3. 09 7月, 2011 1 次提交
  4. 28 6月, 2011 1 次提交
  5. 16 6月, 2011 2 次提交
    • S
      mm: fix wrong kunmap_atomic() pointer · 5f1a1907
      Steven Rostedt 提交于
      Running a ktest.pl test, I hit the following bug on x86_32:
      
        ------------[ cut here ]------------
        WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
         Hardware name:
        Modules linked in:
        Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
        Call Trace:
         [<c04450da>] warn_slowpath_common+0x7c/0x91
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1^M
         [<c0445111>] warn_slowpath_null+0x22/0x24
         [<c042f5df>] __kunmap_atomic+0x64/0xc1
         [<c04d4a22>] unmap_vmas+0x43a/0x4e0
         [<c04d9065>] exit_mmap+0x91/0xd2
         [<c0443057>] mmput+0x43/0xad
         [<c0448358>] exit_mm+0x111/0x119
         [<c044855f>] do_exit+0x1ff/0x5fa
         [<c0454ea2>] ? set_current_blocked+0x3c/0x40
         [<c0454f24>] ? sigprocmask+0x7e/0x8e
         [<c0448b55>] do_group_exit+0x65/0x88
         [<c0448b90>] sys_exit_group+0x18/0x1c
         [<c0c3915f>] sysenter_do_call+0x12/0x38
        ---[ end trace 8055f74ea3c0eb62 ]---
      
      Running a ktest.pl git bisect, found the culprit: commit e303297e
      ("mm: extended batches for generic mmu_gather")
      
      But although this was the commit triggering the bug, it was not the one
      originally responsible for the bug.  That was commit d16dfc55 ("mm:
      mmu_gather rework").
      
      The code in zap_pte_range() has something that looks like the following:
      
      	pte =  pte_offset_map_lock(mm, pmd, addr, &ptl);
      	do {
      		[...]
      	} while (pte++, addr += PAGE_SIZE, addr != end);
      	pte_unmap_unlock(pte - 1, ptl);
      
      The pte starts off pointing at the first element in the page table
      directory that was returned by the pte_offset_map_lock().  When it's done
      with the page, pte will be pointing to anything between the next entry and
      the first entry of the next page inclusive.  By doing a pte - 1, this puts
      the pte back onto the original page, which is all that pte_unmap_unlock()
      needs.
      
      In most archs (64 bit), this is not an issue as the pte is ignored in the
      pte_unmap_unlock().  But on 32 bit archs, where things may be kmapped, it
      is essential that the pte passed to pte_unmap_unlock() resides on the same
      page that was given by pte_offest_map_lock().
      
      The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
      a "break;" from the while loop.  This alone did not seem to easily trigger
      the bug.  But the modifications made by e303297e caused that "break;" to
      be hit on the first iteration, before the pte++.
      
      The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
      be pointing to the previous page.  This will cause the wrong page to be
      unmapped, and also trigger the warning above.
      
      The simple solution is to just save the pointer given by
      pte_offset_map_lock() and use it in the unlock.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1a1907
    • R
      mm/memory.c: fix kernel-doc notation · 0164f69d
      Randy Dunlap 提交于
      Fix new kernel-doc warnings in mm/memory.c:
      
        Warning(mm/memory.c:1327): No description found for parameter 'tlb'
        Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0164f69d
  6. 27 5月, 2011 2 次提交
    • Y
      memcg: add the pagefault count into memcg stats · 456f998e
      Ying Han 提交于
      Two new stats in per-memcg memory.stat which tracks the number of page
      faults and number of major page faults.
      
        "pgfault"
        "pgmajfault"
      
      They are different from "pgpgin"/"pgpgout" stat which count number of
      pages charged/discharged to the cgroup and have no meaning of reading/
      writing page to disk.
      
      It is valuable to track the two stats for both measuring application's
      performance as well as the efficiency of the kernel page reclaim path.
      Counting pagefaults per process is useful, but we also need the aggregated
      value since processes are monitored and controlled in cgroup basis in
      memcg.
      
      Functional test: check the total number of pgfault/pgmajfault of all
      memcgs and compare with global vmstat value:
      
        $ cat /proc/vmstat | grep fault
        pgfault 1070751
        pgmajfault 553
      
        $ cat /dev/cgroup/memory.stat | grep fault
        pgfault 1071138
        pgmajfault 553
        total_pgfault 1071142
        total_pgmajfault 553
      
        $ cat /dev/cgroup/A/memory.stat | grep fault
        pgfault 199
        pgmajfault 0
        total_pgfault 199
        total_pgmajfault 0
      
      Performance test: run page fault test(pft) wit 16 thread on faulting in
      15G anon pages in 16G container.  There is no regression noticed on the
      "flt/cpu/s"
      
      Sample output from pft:
      
        TAG pft:anon-sys-default:
          Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
          15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
      
        +-------------------------------------------------------------------------+
            N           Min           Max        Median           Avg        Stddev
        x  10     16682.962     17344.027     16913.524     16928.812      166.5362
        +  10     16695.568     16923.896     16820.604     16824.652     84.816568
        No difference proven at 95.0% confidence
      
      [akpm@linux-foundation.org: fix build]
      [hughd@google.com: shmem fix]
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      456f998e
    • K
      mm: don't access vm_flags as 'int' · ca16d140
      KOSAKI Motohiro 提交于
      The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
      'unsigned int'. This patch fixes such misuse.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      [ Changed to use a typedef - we'll extend it to cover more cases
        later, since there has been discussion about making it a 64-bit
        type..                      - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca16d140
  7. 25 5月, 2011 7 次提交
    • P
      mm: uninline large generic tlb.h functions · 9547d01b
      Peter Zijlstra 提交于
      Some of these functions have grown beyond inline sanity, move them
      out-of-line.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Requested-by: NAndrew Morton <akpm@linux-foundation.org>
      Requested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9547d01b
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
    • P
      mm: extended batches for generic mmu_gather · e303297e
      Peter Zijlstra 提交于
      Instead of using a single batch (the small on-stack, or an allocated
      page), try and extend the batch every time it runs out and only flush once
      either the extend fails or we're done.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Requested-by: NNick Piggin <npiggin@kernel.dk>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e303297e
    • P
      mm, powerpc: move the RCU page-table freeing into generic code · 26723911
      Peter Zijlstra 提交于
      In case other architectures require RCU freed page-tables to implement
      gup_fast() and software filled hashes and similar things, provide the
      means to do so by moving the logic into generic code.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Requested-by: NDavid Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26723911
    • P
      mm: mmu_gather rework · d16dfc55
      Peter Zijlstra 提交于
      Rework the existing mmu_gather infrastructure.
      
      The direct purpose of these patches was to allow preemptible mmu_gather,
      but even without that I think these patches provide an improvement to the
      status quo.
      
      The first 9 patches rework the mmu_gather infrastructure.  For review
      purpose I've split them into generic and per-arch patches with the last of
      those a generic cleanup.
      
      The next patch provides generic RCU page-table freeing, and the followup
      is a patch converting s390 to use this.  I've also got 4 patches from
      DaveM lined up (not included in this series) that uses this to implement
      gup_fast() for sparc64.
      
      Then there is one patch that extends the generic mmu_gather batching.
      
      After that follow the mm preemptibility patches, these make part of the mm
      a lot more preemptible.  It converts i_mmap_lock and anon_vma->lock to
      mutexes which together with the mmu_gather rework makes mmu_gather
      preemptible as well.
      
      Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
      
      This also allows for preemptible mmu_notifiers, something that XPMEM I
      think wants.
      
      Furthermore, it removes the new and universially detested unmap_mutex.
      
      This patch:
      
      Remove the first obstacle towards a fully preemptible mmu_gather.
      
      The current scheme assumes mmu_gather is always done with preemption
      disabled and uses per-cpu storage for the page batches.  Change this to
      try and allocate a page for batching and in case of failure, use a small
      on-stack array to make some progress.
      
      Preemptible mmu_gather is desired in general and usable once i_mmap_lock
      becomes a mutex.  Doing it before the mutex conversion saves us from
      having to rework the code by moving the mmu_gather bits inside the
      pte_lock.
      
      Also avoid flushing the tlb batches from under the pte lock, this is
      useful even without the i_mmap_lock conversion as it significantly reduces
      pte lock hold times.
      
      [akpm@linux-foundation.org: fix comment tpyo]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d16dfc55
    • M
      mm: make expand_downwards() symmetrical with expand_upwards() · d05f3169
      Michal Hocko 提交于
      Currently we have expand_upwards exported while expand_downwards is
      accessible only via expand_stack or expand_stack_downwards.
      
      check_stack_guard_page is a nice example of the asymmetry.  It uses
      expand_stack for VM_GROWSDOWN while expand_upwards is called for
      VM_GROWSUP case.
      
      Let's clean this up by exporting both functions and make those names
      consistent.  Let's use expand_{upwards,downwards} because expanding
      doesn't always involve stack manipulation (an example is
      ia64_do_page_fault which uses expand_upwards for registers backing store
      expansion).  expand_downwards has to be defined for both
      CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
      version in the early process initialization phase for growsup
      configuration.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05f3169
  8. 10 5月, 2011 1 次提交
    • M
      Don't lock guardpage if the stack is growing up · a09a79f6
      Mikulas Patocka 提交于
      Linux kernel excludes guard page when performing mlock on a VMA with
      down-growing stack. However, some architectures have up-growing stack
      and locking the guard page should be excluded in this case too.
      
      This patch fixes lvm2 on PA-RISC (and possibly other architectures with
      up-growing stack). lvm2 calculates number of used pages when locking and
      when unlocking and reports an internal error if the numbers mismatch.
      
      [ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
        grows-up case, and to move things around a bit to clean it all up and
        share the infrstructure with the /proc bits.
      
        Tested on ia64 that has both grow-up and grow-down segments  - Linus ]
      Signed-off-by: NMikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Tested-by: NTony Luck <tony.luck@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a09a79f6
  9. 05 5月, 2011 1 次提交
    • L
      VM: skip the stack guard page lookup in get_user_pages only for mlock · a1fde08c
      Linus Torvalds 提交于
      The logic in __get_user_pages() used to skip the stack guard page lookup
      whenever the caller wasn't interested in seeing what the actual page
      was.  But Michel Lespinasse points out that there are cases where we
      don't care about the physical page itself (so 'pages' may be NULL), but
      do want to make sure a page is mapped into the virtual address space.
      
      So using the existence of the "pages" array as an indication of whether
      to look up the guard page or not isn't actually so great, and we really
      should just use the FOLL_MLOCK bit.  But because that bit was only set
      for the VM_LOCKED case (and not all vma's necessarily have it, even for
      mlock()), we couldn't do that originally.
      
      Fix that by moving the VM_LOCKED check deeper into the call-chain, which
      actually simplifies many things.  Now mlock() gets simpler, and we can
      also check for FOLL_MLOCK in __get_user_pages() and the code ends up
      much more straightforward.
      Reported-and-reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1fde08c
  10. 29 4月, 2011 1 次提交
    • M
      mm: check if PTE is already allocated during page fault · cc03638d
      Mel Gorman 提交于
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault.  To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
      pte_alloc_map is unsafe to run against a huge PMD.  pte_offset_map() is
      called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost.  As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.  Thi
      useless PTE does get cleaned up but it's a performance hit which is
      visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table lock.
      The effect is noticable in page_test from aim9.
      
        AIM9
                        2.6.38-vanilla 2.6.38-checkptenone
        creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
        page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
        brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
        exec_test      382.00 ( 0.00%)   456.90 (16.39%)
        fork_test       60.11 ( 0.00%)    67.79 (11.34%)
        MMTests Statistics: duration
        Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a functional
      bug and normally outside the rules -stable.  While the big performance
      differences are to a microbench, the difference in fork and exec
      performance may be significant enough that -stable wants to consider the
      patch)
      Reported-by: NRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc03638d
  11. 15 4月, 2011 1 次提交
    • M
      mm: check that we have the right vma in __access_remote_vm() · fe936dfc
      Michael Ellerman 提交于
      In __access_remote_vm() we need to check that we have found the right
      vma, not the following vma before we try to access it.  Otherwise we
      might call the vma's access routine with an address which does not fall
      inside the vma.
      
      It was discovered on a current kernel but with an unreleased driver,
      from memory it was strace leading to a kernel bad access, but it
      obviously depends on what the access implementation does.
      
      Looking at other access implementations I only see:
      
        $ git grep -A 5 vm_operations|grep access
        arch/powerpc/platforms/cell/spufs/file.c-	.access = spufs_mem_mmap_access,
        arch/x86/pci/i386.c-	.access = generic_access_phys,
        drivers/char/mem.c-	.access = generic_access_phys
        fs/sysfs/bin.c-	.access		= bin_access,
      
      The spufs one looks like it might behave badly given the wrong vma, it
      assumes vma->vm_file->private_data is a spu_context, and looks like it
      would probably blow up pretty quickly if it wasn't.
      
      generic_access_phys() only uses the vma to check vm_flags and get the
      mm, and then walks page tables using the address.  So it should bail on
      the vm_flags check, or at worst let you access some other VM_IO mapping.
      
      And bin_access() just proxies to another access implementation.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe936dfc
  12. 13 4月, 2011 1 次提交
    • L
      vm: fix mlock() on stack guard page · 95042f9e
      Linus Torvalds 提交于
      Commit 53a7706d ("mlock: do not hold mmap_sem for extended periods
      of time") changed mlock() to care about the exact number of pages that
      __get_user_pages() had brought it.  Before, it would only care about
      errors.
      
      And that doesn't work, because we also handled one page specially in
      __mlock_vma_pages_range(), namely the stack guard page.  So when that
      case was handled, the number of pages that the function returned was off
      by one.  In particular, it could be zero, and then the caller would end
      up not making any progress at all.
      
      Rather than try to fix up that off-by-one error for the mlock case
      specially, this just moves the logic to handle the stack guard page
      into__get_user_pages() itself, thus making all the counts come out
      right automatically.
      Reported-by: NRobert Święcki <robert@swiecki.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95042f9e
  13. 28 3月, 2011 1 次提交
  14. 24 3月, 2011 7 次提交
  15. 23 3月, 2011 1 次提交
    • G
      mm: allow GUP to fail instead of waiting on a page · 318b275f
      Gleb Natapov 提交于
      GUP user may want to try to acquire a reference to a page if it is already
      in memory, but not if IO, to bring it in, is needed.  For example KVM may
      tell vcpu to schedule another guest process if current one is trying to
      access swapped out page.  Meanwhile, the page will be swapped in and the
      guest process, that depends on it, will be able to run again.
      
      This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
      FOLL_NOWAIT follow_page flags.  FAULT_FLAG_RETRY_NOWAIT, when used in
      conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
      it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
      instead.
      
      [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      318b275f
  16. 18 3月, 2011 2 次提交
    • H
      mm: make __get_user_pages return -EHWPOISON for HWPOISON page optionally · 69ebb83e
      Huang Ying 提交于
      Make __get_user_pages return -EHWPOISON for HWPOISON page only if
      FOLL_HWPOISON is specified.  With this patch, the interested callers
      can distinguish HWPOISON pages from general FAULT pages, while other
      callers will still get -EFAULT for all these pages, so the user space
      interface need not to be changed.
      
      This feature is needed by KVM, where UCR MCE should be relayed to
      guest for HWPOISON page, while instruction emulation and MMIO will be
      tried for general FAULT page.
      
      The idea comes from Andrew Morton.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      69ebb83e
    • H
      mm: export __get_user_pages · 0014bd99
      Huang Ying 提交于
      In most cases, get_user_pages and get_user_pages_fast should be used
      to pin user pages in memory.  But sometimes, some special flags except
      FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
      following patch, KVM needs FOLL_HWPOISON.  To support these users,
      __get_user_pages is exported directly.
      
      There are some symbol name conflicts in infiniband driver, fixed them too.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Michel Lespinasse <walken@google.com>
      CC: Roland Dreier <roland@kernel.org>
      CC: Ralph Campbell <infinipath@qlogic.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      0014bd99
  17. 24 2月, 2011 1 次提交
    • M
      mm: prevent concurrent unmap_mapping_range() on the same inode · 2aa15890
      Miklos Szeredi 提交于
      Michael Leun reported that running parallel opens on a fuse filesystem
      can trigger a "kernel BUG at mm/truncate.c:475"
      
      Gurudas Pai reported the same bug on NFS.
      
      The reason is, unmap_mapping_range() is not prepared for more than
      one concurrent invocation per inode.  For example:
      
        thread1: going through a big range, stops in the middle of a vma and
           stores the restart address in vm_truncate_count.
      
        thread2: comes in with a small (e.g. single page) unmap request on
           the same vma, somewhere before restart_address, finds that the
           vma was already unmapped up to the restart address and happily
           returns without doing anything.
      
      Another scenario would be two big unmap requests, both having to
      restart the unmapping and each one setting vm_truncate_count to its
      own value.  This could go on forever without any of them being able to
      finish.
      
      Truncate and hole punching already serialize with i_mutex.  Other
      callers of unmap_mapping_range() do not, and it's difficult to get
      i_mutex protection for all callers.  In particular ->d_revalidate(),
      which calls invalidate_inode_pages2_range() in fuse, may be called
      with or without i_mutex.
      
      This patch adds a new mutex to 'struct address_space' to prevent
      running multiple concurrent unmap_mapping_range() on the same mapping.
      
      [ We'll hopefully get rid of all this with the upcoming mm
        preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
        lockbreak" patch in particular.  But that is for 2.6.39 ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Reported-by: NMichael Leun <lkml20101129@newton.leun.net>
      Reported-by: NGurudas Pai <gurudas.pai@oracle.com>
      Tested-by: NGurudas Pai <gurudas.pai@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aa15890
  18. 17 2月, 2011 1 次提交
  19. 12 2月, 2011 2 次提交
  20. 14 1月, 2011 3 次提交