1. 04 8月, 2011 1 次提交
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
  2. 26 7月, 2011 2 次提交
    • H
      mm: consistent truncate and invalidate loops · b85e0eff
      Hugh Dickins 提交于
      Make the pagevec_lookup loops in truncate_inode_pages_range(),
      invalidate_mapping_pages() and invalidate_inode_pages2_range() more
      consistent with each other.
      
      They were relying upon page->index of an unlocked page, but apologizing
      for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
      index handling.
      
      invalidate_inode_pages2_range() had special handling for a wrapped
      page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
      near there, and a corrupt page->index in the radix_tree could cause more
      trouble than that would catch.  Remove that wrapped handling.
      
      invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
      when near the end of the range: copy that into the other two, although
      it's less useful than you might think (it limits the use of the buffer,
      rather than the indices looked up).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b85e0eff
    • H
      mm: cleanup descriptions of filler arg · 5e5358e7
      Hugh Dickins 提交于
      The often-NULL data arg to read_cache_page() and read_mapping_page()
      functions is misdescribed as "destination for read data": no, it's the
      first arg to the filler function, often struct file * to ->readpage().
      
      Satisfy checkpatch.pl on those filler prototypes, and tidy up the
      declarations in linux/pagemap.h.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e5358e7
  3. 21 7月, 2011 1 次提交
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  4. 08 6月, 2011 1 次提交
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  5. 04 6月, 2011 1 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
  6. 29 5月, 2011 1 次提交
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  7. 28 5月, 2011 1 次提交
  8. 27 5月, 2011 2 次提交
    • Y
      memcg: add the pagefault count into memcg stats · 456f998e
      Ying Han 提交于
      Two new stats in per-memcg memory.stat which tracks the number of page
      faults and number of major page faults.
      
        "pgfault"
        "pgmajfault"
      
      They are different from "pgpgin"/"pgpgout" stat which count number of
      pages charged/discharged to the cgroup and have no meaning of reading/
      writing page to disk.
      
      It is valuable to track the two stats for both measuring application's
      performance as well as the efficiency of the kernel page reclaim path.
      Counting pagefaults per process is useful, but we also need the aggregated
      value since processes are monitored and controlled in cgroup basis in
      memcg.
      
      Functional test: check the total number of pgfault/pgmajfault of all
      memcgs and compare with global vmstat value:
      
        $ cat /proc/vmstat | grep fault
        pgfault 1070751
        pgmajfault 553
      
        $ cat /dev/cgroup/memory.stat | grep fault
        pgfault 1071138
        pgmajfault 553
        total_pgfault 1071142
        total_pgmajfault 553
      
        $ cat /dev/cgroup/A/memory.stat | grep fault
        pgfault 199
        pgmajfault 0
        total_pgfault 199
        total_pgmajfault 0
      
      Performance test: run page fault test(pft) wit 16 thread on faulting in
      15G anon pages in 16G container.  There is no regression noticed on the
      "flt/cpu/s"
      
      Sample output from pft:
      
        TAG pft:anon-sys-default:
          Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
          15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
      
        +-------------------------------------------------------------------------+
            N           Min           Max        Median           Avg        Stddev
        x  10     16682.962     17344.027     16913.524     16928.812      166.5362
        +  10     16695.568     16923.896     16820.604     16824.652     84.816568
        No difference proven at 95.0% confidence
      
      [akpm@linux-foundation.org: fix build]
      [hughd@google.com: shmem fix]
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      456f998e
    • D
      mm/fs: add hooks to support cleancache · c515e1fd
      Dan Magenheimer 提交于
      This fourth patch of eight in this cleancache series provides the
      core hooks in VFS for: initializing cleancache per filesystem;
      capturing clean pages reclaimed by page cache; attempting to get
      pages from cleancache before filesystem read; and ensuring coherency
      between pagecache, disk, and cleancache.  Note that the placement
      of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
      change was required due to a patchset in 2.6.39.
      
      All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
      a check of a boolean global if CONFIG_CLEANCACHE is set but no
      cleancache "backend" has claimed cleancache_ops.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      c515e1fd
  9. 25 5月, 2011 6 次提交
  10. 25 3月, 2011 2 次提交
    • D
      fs: move i_wb_list out from under inode_lock · a66979ab
      Dave Chinner 提交于
      Protect the inode writeback list with a new global lock
      inode_wb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a66979ab
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  11. 23 3月, 2011 7 次提交
  12. 10 3月, 2011 2 次提交
  13. 14 1月, 2011 3 次提交
    • S
      mm: remove likely() from grab_cache_page_write_begin() · c585a267
      Steven Rostedt 提交于
      Running the annotated branch profiler on a box doing average work
      (firefox, evolution, xchat, distcc farm), the likely() used in
      grab_cache_page_write_begin() was incorrect most of the time:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
       1924262 71332401  97 grab_cache_page_write_begin    filemap.c           2206
      
      Adding a trace_printk() and running the function tracer limited to
      just this function I can see:
      
              gconfd-2-2696  [000]  4467.268935: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=7
              gconfd-2-2696  [000]  4467.268946: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268947: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=8
              gconfd-2-2696  [000]  4467.268959: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268960: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=9
              gconfd-2-2696  [000]  4467.268972: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268973: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=10
              gconfd-2-2696  [000]  4467.268991: grab_cache_page_write_begin <-ext3_write_begin
              gconfd-2-2696  [000]  4467.268992: grab_cache_page_write_begin: page=          (null) mapping=ffff8800676a9460 index=11
              gconfd-2-2696  [000]  4467.269005: grab_cache_page_write_begin <-ext3_write_begin
      
      Which shows that a lot of calls from ext3_write_begin will result in the
      page returned by "find_lock_page" will be NULL.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c585a267
    • R
      mm: clear PageError bit in msync & fsync · 212260aa
      Rik van Riel 提交于
      Temporary IO failures, eg.  due to loss of both multipath paths, can
      permanently leave the PageError bit set on a page, resulting in msync or
      fsync returning -EIO over and over again, even if IO is now getting to the
      disk correctly.
      
      We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
      filemap_fdatawait_range function.  Also clearing the PageError bit on the
      page allows subsequent msync or fsync calls on this file to return without
      an error, if the subsequent IO succeeds.
      
      Unfortunately data written out in the msync or fsync call that returned
      -EIO can still get lost, because the page dirty bit appears to not get
      restored on IO error.  However, the alternative could be potentially all
      of memory filling up with uncleanable dirty pages, hanging the system, so
      there is no nice choice here...
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NValerie Aurora <vaurora@redhat.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      212260aa
    • N
      mm: find_get_pages_contig fixlet · 9cbb4cb2
      Nick Piggin 提交于
      Testing ->mapping and ->index without a ref is not stable as the page
      may have been reused at this point.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cbb4cb2
  14. 07 1月, 2011 1 次提交
  15. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  16. 12 11月, 2010 2 次提交
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd
    • D
      mm/vfs: revalidate page->mapping in do_generic_file_read() · 8d056cb9
      Dave Hansen 提交于
      70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
      ran into a NULL dereference in here:
      
      	int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
      	                                        unsigned long from)
      	{
      ---->		struct inode *inode = page->mapping->host;
      
      It looks like page->mapping was the culprit.  (xmon trace is below).
      After closer examination, I realized that do_generic_file_read() does a
      find_get_page(), and eventually locks the page before calling
      block_is_partially_uptodate().  However, it doesn't revalidate the
      page->mapping after the page is locked.  So, there's a small window
      between the find_get_page() and ->is_partially_uptodate() where the page
      could get truncated and page->mapping cleared.
      
      We _have_ a reference, so it can't get reclaimed, but it certainly
      can be truncated.
      
      I think the correct thing is to check page->mapping after the
      trylock_page(), and jump out if it got truncated.  This patch has been
      running in the test environment for a month or so now, and we have not
      seen this bug pop up again.
      
      xmon info:
      
        1f:mon> e
        cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
            pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
            lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
            sp: c0000002ae36f9f0
           msr: 8000000000009032
           dar: 0
         dsisr: 40000000
          current = 0xc000000378f99e30
          paca    = 0xc000000000f66300
            pid   = 21946, comm = bash
        1f:mon> r
        R00 = 0025c0500000006d   R16 = 0000000000000000
        R01 = c0000002ae36f9f0   R17 = c000000362cd3af0
        R02 = c000000000e8cd80   R18 = ffffffffffffffff
        R03 = c0000000031d0f88   R19 = 0000000000000001
        R04 = c0000002ae36fa68   R20 = c0000003bb97b8a0
        R05 = 0000000000000000   R21 = c0000002ae36fa68
        R06 = 0000000000000000   R22 = 0000000000000000
        R07 = 0000000000000001   R23 = c0000002ae36fbb0
        R08 = 0000000000000002   R24 = 0000000000000000
        R09 = 0000000000000000   R25 = c000000362cd3a80
        R10 = 0000000000000000   R26 = 0000000000000002
        R11 = c0000000001e7b60   R27 = 0000000000000000
        R12 = 0000000042000484   R28 = 0000000000000001
        R13 = c000000000f66300   R29 = c0000003bb97b9b8
        R14 = 0000000000000001   R30 = c000000000e28a08
        R15 = 000000000000ffff   R31 = c0000000031d0f88
        pc  = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
        lr  = c000000000142944 .generic_file_aio_read+0x1e4/0x770
        msr = 8000000000009032   cr  = 22000488
        ctr = c0000000001e7a60   xer = 0000000020000000   trap =  300
        dar = 0000000000000000   dsisr = 40000000
        1f:mon> t
        [link register   ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
        [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
        [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
        [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
        [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
        [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
        --- Exception: c00 (System Call) at 00000080a840bc54
        SP (fffca15df30) is in userspace
        1f:mon> di c0000000001e7a6c
        c0000000001e7a6c  e9290000      ld      r9,0(r9)
        c0000000001e7a70  418200c0      beq     c0000000001e7b30        # .block_is_partially_uptodate+0xd0/0x100
        c0000000001e7a74  e9440008      ld      r10,8(r4)
        c0000000001e7a78  78a80020      clrldi  r8,r5,32
        c0000000001e7a7c  3c000001      lis     r0,1
        c0000000001e7a80  812900a8      lwz     r9,168(r9)
        c0000000001e7a84  39600001      li      r11,1
        c0000000001e7a88  7c080050      subf    r0,r8,r0
        c0000000001e7a8c  7f805040      cmplw   cr7,r0,r10
        c0000000001e7a90  7d6b4830      slw     r11,r11,r9
        c0000000001e7a94  796b0020      clrldi  r11,r11,32
        c0000000001e7a98  419d00a8      bgt     cr7,c0000000001e7b40    # .block_is_partially_uptodate+0xe0/0x100
        c0000000001e7a9c  7fa55840      cmpld   cr7,r5,r11
        c0000000001e7aa0  7d004214      add     r8,r0,r8
        c0000000001e7aa4  79080020      clrldi  r8,r8,32
        c0000000001e7aa8  419c0078      blt     cr7,c0000000001e7b20    # .block_is_partially_uptodate+0xc0/0x100
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <arunabal@in.ibm.com>
      Cc: <sbest@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d056cb9
  17. 03 11月, 2010 1 次提交
  18. 27 10月, 2010 3 次提交
  19. 10 8月, 2010 1 次提交
  20. 27 5月, 2010 1 次提交