1. 21 7月, 2011 1 次提交
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  2. 20 7月, 2011 5 次提交
    • D
      vmscan: add customisable shrinker batch size · e9299f50
      Dave Chinner 提交于
      For shrinkers that have their own cond_resched* calls, having
      shrink_slab break the work down into small batches is not
      paticularly efficient. Add a custom batchsize field to the struct
      shrinker so that shrinkers can use a larger batch size if they
      desire.
      
      A value of zero (uninitialised) means "use the default", so
      behaviour is unchanged by this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e9299f50
    • D
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 3567b59a
      Dave Chinner 提交于
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3567b59a
    • D
      vmscan: shrinker->nr updates race and go wrong · acf92b48
      Dave Chinner 提交于
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acf92b48
    • D
      vmscan: add shrink_slab tracepoints · 09576073
      Dave Chinner 提交于
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      09576073
    • S
      vmscan: fix a livelock in kswapd · 4746efde
      Shaohua Li 提交于
      I'm running a workload which triggers a lot of swap in a machine with 4
      nodes.  After I kill the workload, I found a kswapd livelock.  Sometimes
      kswapd3 or kswapd2 are keeping running and I can't access filesystem,
      but most memory is free.
      
      This looks like a regression since commit 08951e54 ("mm: vmscan:
      correct check for kswapd sleeping in sleeping_prematurely").
      
      Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
      for classzone_idx.  The reason is end_zone in balance_pgdat() is 0 by
      default, if all zones have watermark ok, end_zone will keep 0.
      
      Later sleeping_prematurely() always returns true.  Because this is an
      order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
      present_pages in pgdat_balanced() are 0.  We add a special case here.
      If a zone has no page, we think it's balanced.  This fixes the livelock.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4746efde
  3. 09 7月, 2011 8 次提交
    • B
      mm/nommu.c: fix remap_pfn_range() · 8f3b1327
      Bob Liu 提交于
      remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr.
      
      For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which
      is wrong acroding the original meaning of this function.  And some driver
      developer using remap_pfn_range() with correct parameter will get
      unexpected result because vm_start is changed.  It should be implementd
      like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this
      patch just make it simply return.
      
      Parameter name and setting of vma->vm_flags also be fixed.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f3b1327
    • K
      memcg: fix numa scan information update to be triggered by memory event · 453a9bf3
      KAMEZAWA Hiroyuki 提交于
      commit 889976db ("memcg: reclaim memory from nodes in round-robin
      order") adds an numa node round-robin for memcg.  But the information is
      updated once per 10sec.
      
      This patch changes the update trigger from jiffies to memcg's event count.
       After this patch, numa scan information will be updated when we see 1024
      events of pagein/pageout under a memcg.
      
      [akpm@linux-foundation.org: attempt to repair code layout]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453a9bf3
    • K
      memcg: fix reclaimable lru check in memcg · 4d0c066d
      KAMEZAWA Hiroyuki 提交于
      Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
      used for checking whether the memcg contains reclaimable pages or not.  If
      no pages in it, the routine skips it.
      
      But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
      "noswap" condition correctly.  This doesn't work on a swapless system.
      
      This patch adds test_mem_cgroup_reclaimable() and replaces
      mem_cgroup_local_usage().  test_mem_cgroup_reclaimable() see LRU counter
      and returns correct answer to the caller.  And this new function has
      "noswap" argument and can see only FILE LRU if necessary.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix kerneldoc layout]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d0c066d
    • S
      mm: __tlb_remove_page() check the correct batch · 0b43c3aa
      Shaohua Li 提交于
      __tlb_remove_page() switches to a new batch page, but still checks space
      in the old batch.  This check always fails, and causes a forced tlb flush.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b43c3aa
    • M
      mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully · 215ddd66
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.  Unfortunately, if the highest zone is small, a
      problem occurs.
      
      When balance_pgdat() returns, it may be at a lower classzone_idx than it
      started because the highest zone was unreclaimable.  Before checking if it
      should go to sleep though, it checks pgdat->classzone_idx which when there
      is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
      been woken up while reclaiming, skips scheduling and reclaims again.  As
      there is no useful reclaim work to do, it enters into a loop of shrinking
      slab consuming loads of CPU until the highest zone becomes reclaimable for
      a long period of time.
      
      There are two problems here.  1) If the returned classzone or order is
      lower, it'll continue reclaiming without scheduling.  2) if the highest
      zone was marked unreclaimable but balance_pgdat() returns immediately at
      DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
      for sleeping.
      
      This patch does two things that are related.  If the end_zone is
      unreclaimable, this information is communicated back.  Second, if the
      classzone or order was reduced due to failing to reclaim, new information
      is not read from pgdat and instead an attempt is made to go to sleep.  Due
      to this, it is also necessary that pgdat->classzone_idx be initialised
      each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
      wakeups.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      215ddd66
    • M
      mm: vmscan: evaluate the watermarks against the correct classzone · da175d06
      Mel Gorman 提交于
      When deciding if kswapd is sleeping prematurely, the classzone is taken
      into account but this is different to what balance_pgdat() and the
      allocator are doing.  Specifically, the DMA zone will be checked based on
      the classzone used when waking kswapd which could be for a GFP_KERNEL or
      GFP_HIGHMEM request.  The lowmem reserve limit kicks in, the watermark is
      not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
      error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da175d06
    • M
      mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone · d7868dae
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.
      
      When kswapd applies pressure to zones during node balancing, it checks if
      the zone is above a high+balance_gap threshold.  If it is, it does not
      apply pressure but it unconditionally shrinks slab on a global basis which
      is excessive.  In the event kswapd is being kept awake due to a high small
      unreclaimable zone, it skips zone shrinking but still calls shrink_slab().
      
      Once pressure has been applied, the check for zone being unreclaimable is
      being made before the check is made if all_unreclaimable should be set.
      This miss of unreclaimable can cause has_under_min_watermark_zone to be
      set due to an unreclaimable zone preventing kswapd backing off on
      congestion_wait().
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7868dae
    • M
      mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely · 08951e54
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.  Unfortunately, if the highest zone is small, a
      problem occurs.
      
      This seems to happen most with recent sandybridge laptops but it's
      probably a co-incidence as some of these laptops just happen to have a
      small Normal zone.  The reproduction case is almost always during copying
      large files that kswapd pegs at 100% CPU until the file is deleted or
      cache is dropped.
      
      The problem is mostly down to sleeping_prematurely() keeping kswapd awake
      when the highest zone is small and unreclaimable and compounded by the
      fact we shrink slabs even when not shrinking zones causing a lot of time
      to be spent in shrinkers and a lot of memory to be reclaimed.
      
      Patch 1 corrects sleeping_prematurely to check the zones matching
      	the classzone_idx instead of all zones.
      
      Patch 2 avoids shrinking slab when we are not shrinking a zone.
      
      Patch 3 notes that sleeping_prematurely is checking lower zones against
      	a high classzone which is not what allocators or balance_pgdat()
      	is doing leading to an artifical belief that kswapd should be
      	still awake.
      
      Patch 4 notes that when balance_pgdat() gives up on a high zone that the
      	decision is not communicated to sleeping_prematurely()
      
      This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
      and 3.0-rc4 as well.  If accepted, they need to go to -stable to be picked
      up by distros and this series is against 3.0-rc4.  I've cc'd people that
      reported similar problems recently to see if they still suffer from the
      problem and if this fixes it.
      
      This patch: correct the check for kswapd sleeping in sleeping_prematurely()
      
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.
      
      A problem occurs if the highest zone is small.  balance_pgdat() only
      considers unreclaimable zones when priority is DEF_PRIORITY but
      sleeping_prematurely considers all zones.  It's possible for this sequence
      to occur
      
        1. kswapd wakes up and enters balance_pgdat()
        2. At DEF_PRIORITY, marks highest zone unreclaimable
        3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
        4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
              highest zone, clearing all_unreclaimable. Highest zone
              is still unbalanced
        5. kswapd returns and calls sleeping_prematurely
        6. sleeping_prematurely looks at *all* zones, not just the ones
           being considered by balance_pgdat. The highest small zone
           has all_unreclaimable cleared but the zone is not
           balanced. all_zones_ok is false so kswapd stays awake
      
      This patch corrects the behaviour of sleeping_prematurely to check the
      zones balance_pgdat() checked.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08951e54
  4. 28 6月, 2011 7 次提交
    • K
      memcg: fix direct softlimit reclaim to be called in limit path · ac34a1a3
      KAMEZAWA Hiroyuki 提交于
      Commit d149e3b2 ("memcg: add the soft_limit reclaim in global direct
      reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
      is called as
      
         try_to_free_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      Then, direct reclaim is memcg softlimit hint aware, now.
      
      But, the memory cgroup's "limit" path can call softlimit shrinker.
      
         try_to_free_mem_cgroup_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      This will cause a global reclaim when a memcg hits limit.
      
      This is bug. soft_limit_reclaim() should be called when
      scanning_global_lru(sc) == true.
      
      And the commit adds a variable "total_scanned" for counting softlimit
      scanned pages....it's not "total".  This patch removes the variable and
      update sc->nr_scanned instead of it.  This will affect shrink_slab()'s
      scan condition but, global LRU is scanned by softlimit and I think this
      change makes sense.
      
      TODO: avoid too much scanning of a zone when softlimit did enough work.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Ying Han <yinghan@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34a1a3
    • J
      mm: fix assertion mapping->nrpages == 0 in end_writeback() · 08142579
      Jan Kara 提交于
      Under heavy memory and filesystem load, users observe the assertion
      mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
      page reclaim reclaiming the last page from a mapping in the following
      race:
      
      	CPU0				CPU1
        ...
        shrink_page_list()
          __remove_mapping()
            __delete_from_page_cache()
              radix_tree_delete()
      					evict_inode()
      					  truncate_inode_pages()
      					    truncate_inode_pages_range()
      					      pagevec_lookup() - finds nothing
      					  end_writeback()
      					    mapping->nrpages != 0 -> BUG
              page->mapping = NULL
              mapping->nrpages--
      
      Fix the problem by doing a reliable check of mapping->nrpages under
      mapping->tree_lock in end_writeback().
      
      Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
      by Miklos Szeredi <mszeredi@suse.de>.
      
      Cc: Jay <jinshan.xiong@whamcloud.com>
      Cc: Miklos Szeredi <mszeredi@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08142579
    • P
      mm/memory-failure.c: fix spinlock vs mutex order · 9b679320
      Peter Zijlstra 提交于
      We cannot take a mutex while holding a spinlock, so flip the order and
      fix the locking documentation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b679320
    • H
      tmpfs: add shmem_read_mapping_page_gfp · d9d90e5e
      Hugh Dickins 提交于
      Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
      is unsuited to tmpfs, because it inserts a page into pagecache before
      calling the filesystem's ->readpage: tmpfs may have pages in swapcache
      which only it knows how to locate and switch to filecache.
      
      At present tmpfs provides a ->readpage method, and copes with this by
      copying pages; but soon we can simplify it by removing its ->readpage.
      Provide shmem_read_mapping_page_gfp() now, ready for that transition,
      
      Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
      with shmem_read_mapping_page() inline for the common mapping_gfp case.
      
      (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
      read_mapping_page functions use the mapping's ->readpage, and the
      read_cache_page functions use the supplied filler, so I think
      read_cache_page_gfp was slightly misnamed.)
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9d90e5e
    • H
      tmpfs: take control of its truncate_range · 94c1e62d
      Hugh Dickins 提交于
      2.6.35's new truncate convention gave tmpfs the opportunity to control
      its file truncation, no longer enforced from outside by vmtruncate().
      We shall want to build upon that, to handle pagecache and swap together.
      
      Slightly redefine the ->truncate_range interface: let it now be called
      between the unmap_mapping_range()s, with the filesystem responsible for
      doing the truncate_inode_pages_range() from it - just as the filesystem
      is nowadays responsible for doing that from its ->setattr.
      
      Let's rename shmem_notify_change() to shmem_setattr().  Instead of
      calling the generic truncate_setsize(), bring that code in so we can
      call shmem_truncate_range() - which will later be updated to perform its
      own variant of truncate_inode_pages_range().
      
      Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
      now that the COW's unmap_mapping_range() comes after ->truncate_range,
      there is no need to call it a third time.
      
      Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
      that i915_gem_object_truncate() can call it explicitly in future; get
      this patch in first, then update drm/i915 once this is available (until
      then, i915 will just be doing the truncate_inode_pages() twice).
      
      Though introduced five years ago, no other filesystem is implementing
      ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
      expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
      whereupon ->truncate_range can be removed from inode_operations -
      shmem_truncate_range() will help i915 across that transition too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94c1e62d
    • H
      mm: move shmem prototypes to shmem_fs.h · 072441e2
      Hugh Dickins 提交于
      Before adding any more global entry points into shmem.c, gather such
      prototypes into shmem_fs.h.  Remove mm's own declarations from swap.h,
      but for now leave the ones in mm.h: because shmem_file_setup() and
      shmem_zero_setup() are called from various places, and we should not
      force other subsystems to update immediately.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072441e2
    • H
      mm: move vmtruncate_range to truncate.c · 5b8ba101
      Hugh Dickins 提交于
      You would expect to find vmtruncate_range() next to vmtruncate() in
      mm/truncate.c: move it there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b8ba101
  5. 23 6月, 2011 2 次提交
  6. 18 6月, 2011 3 次提交
    • L
      mm: avoid anon_vma_chain allocation under anon_vma lock · dd34739c
      Linus Torvalds 提交于
      Hugh Dickins points out that lockdep (correctly) spots a potential
      deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
      of anon_vma_chain while doing anon_vma_clone().  The problem is that
      page reclaim will want to take the anon_vma lock of any anonymous pages
      that it will try to reclaim.
      
      So re-organize the code in anon_vma_clone() slightly: first do just a
      GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
      let's just drop the lock and re-do the allocation, now with GFP_KERNEL.
      
      End result: not only do we avoid the locking problem, this also ends up
      getting better concurrency in case the allocation does need to block.
      Tim Chen reports that with all these anon_vma locking tweaks, we're now
      almost back up to the spinlock performance.
      Reported-and-tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd34739c
    • P
      mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas() · eee2acba
      Peter Zijlstra 提交于
      This matches the anon_vma_clone() case, and uses the same lock helper
      functions.  Because of the need to potentially release the anon_vma's,
      it's a bit more complex, though.
      
      We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
      the anon_vma lock (with the helper function that only takes the lock
      once for the whole loop), and removes any entries that don't need any
      more processing.
      
      The second phase just traverses the remaining list entries (without
      holding the anon_vma lock), and does any actual freeing of the
      anon_vma's that is required.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee2acba
    • L
      mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone() · bb4aa396
      Linus Torvalds 提交于
      In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
      vma, locking the anon_vma for each entry.
      
      But they are all going to have the same root entry, which means that
      we're locking and unlocking the same lock over and over again.  Which is
      expensive in locked operations, but can get _really_ expensive when that
      root entry sees any kind of lock contention.
      
      In fact, Tim Chen reports a big performance regression due to this: when
      we switched to use a mutex instead of a spinlock, the contention case
      gets much worse.
      
      So to alleviate this all, this commit creates a small helper function
      (lock_anon_vma_root()) that can be used to take the lock just once
      rather than taking and releasing it over and over again.
      
      We still have the same "take the lock and release" it behavior in the
      exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
      since we're actually freeing the anon_vma entries as we go, and that
      will touch the lock too.
      Reported-and-tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb4aa396
  7. 17 6月, 2011 1 次提交
  8. 16 6月, 2011 13 次提交