1. 02 10月, 2011 5 次提交
    • A
      btrfs: hooks for readahead · 4bb31e92
      Arne Jansen 提交于
      This adds the hooks needed for readahead. In the readpage_end_io_hook,
      the extent state is checked for the EXTENT_READAHEAD flag. Only in this
      case the readahead hook is called, to keep the impact on non-ra as low
      as possible.
      Additionally, a hook for a failed IO is added, otherwise readahead would
      wait indefinitely for the extent to finish.
      
      Changes for v2:
       - eliminate race condition
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      4bb31e92
    • A
      btrfs: initial readahead code and prototypes · 7414a03f
      Arne Jansen 提交于
      This is the implementation for the generic read ahead framework.
      
      To trigger a readahead, btrfs_reada_add must be called. It will start
      a read ahead for the given range [start, end) on tree root. The returned
      handle can either be used to wait on the readahead to finish
      (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
      
      The read ahead works as follows:
      On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
      reada_start_machine will then search for extents to prefetch and trigger
      some reads. When a read finishes for a node, all contained node/leaf
      pointers that lie in the given range will also be enqueued. The reads will
      be triggered in sequential order, thus giving a big win over a naive
      enumeration. It will also make use of multi-device layouts. Each disk
      will have its on read pointer and all disks will by utilized in parallel.
      Also will no two disks read both sides of a mirror simultaneously, as this
      would waste seeking capacity. Instead both disks will read different parts
      of the filesystem.
      Any number of readaheads can be started in parallel. The read order will be
      determined globally, i.e. 2 parallel readaheads will normally finish faster
      than the 2 started one after another.
      
      Changes v2:
       - protect root->node by transaction instead of node_lock
       - fix missed branches:
          The readahead had a too simple check to determine if a branch from
          a node should be checked or not. It now also records the upper bound
          of each node to see if the requested RA range lies within.
       - use KERN_CONT to debug output, to avoid line breaks
       - defer reada_start_machine to worker to avoid deadlock
      
      Changes v3:
       - protect root->node by rcu
      
      Changes v5:
       - changed EIO-semantics of reada_tree_block_flagged
       - remove spin_lock from reada_control and make elems an atomic_t
       - remove unused read_total from reada_control
       - kill reada_key_cmp, use btrfs_comp_cpu_keys instead
       - use kref-style release functions where possible
       - return struct reada_control * instead of void * from btrfs_reada_add
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      7414a03f
    • A
      btrfs: state information for readahead · 90519d66
      Arne Jansen 提交于
      Add state information for readahead to btrfs_fs_info and btrfs_device
      
      Changes v2:
       - don't wait in radix_trees
       - add own set of workers for readahead
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      90519d66
    • A
      btrfs: add READAHEAD extent buffer flag · ab0fff03
      Arne Jansen 提交于
      Add a READAHEAD extent buffer flag.
      Add a function to trigger a read with this flag set.
      
      Changes v2:
       - use extent buffer flags instead of extent state flags
      
      Changes v5:
       - adapt to changed read_extent_buffer_pages interface
       - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      ab0fff03
    • A
      btrfs: add an extra wait mode to read_extent_buffer_pages · bb82ab88
      Arne Jansen 提交于
      read_extent_buffer_pages currently has two modes, either trigger a read
      without waiting for anything, or wait for the I/O to finish. The former
      also bails when it's unable to lock the page. This patch now adds an
      additional parameter to allow it to block on page lock, but don't wait
      for completion.
      
      Changes v5:
       - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
         WAIT_PAGE_LOCK
      
      Change v6:
       - fix bug introduced in v5
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      bb82ab88
  2. 01 10月, 2011 1 次提交
    • J
      Btrfs: force a page fault if we have a shorty copy on a page boundary · b6316429
      Josef Bacik 提交于
      A user reported a problem where ceph was getting into 100% cpu usage while doing
      some writing.  It turns out it's because we were doing a short write on a not
      uptodate page, which means we'd fall back at one page at a time and fault the
      page in.  The problem is our position is on the page boundary, so our fault in
      logic wasn't actually reading the page, so we'd just spin forever or until the
      page got read in by somebody else.  This will force a readpage if we end up
      doing a short copy.  Alexandre could reproduce this easily with ceph and reports
      it fixes his problem.  I also wrote a reproducer that no longer hangs my box
      with this patch.  Thanks,
      Reported-and-tested-by: NAlexandre Oliva <aoliva@redhat.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b6316429
  3. 21 9月, 2011 1 次提交
  4. 18 9月, 2011 6 次提交
    • J
      Btrfs: only clear the need lookup flag after the dentry is setup · a66e7cc6
      Josef Bacik 提交于
      We can race with readdir and the RCU path walking stuff.  This is because we
      clear the need lookup flag before actually instantiating the inode.  This will
      lead the RCU path walk stuff to find a dentry it thinks is valid without a
      d_inode attached.  So instead unhash the dentry when we first start the lookup,
      and then clear the flag after we've instantiated the dentry so we're garunteed
      to either try the slow lookup, or have the d_inode set properly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a66e7cc6
    • J
      BTRFS: Fix lseek return value for error · 48802c8a
      Jeff Liu 提交于
      The recent reworking of btrfs' lseek lead to incorrect
      values being returned.  This adds checks for seeking
      beyond EOF in SEEK_HOLE and makes sure the error
      values come back correct.
      
      Andi Kleen also sent in similar patches.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reported-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      48802c8a
    • L
      Btrfs: don't change inode flag of the dest clone file · dde820fb
      Li Zefan 提交于
      The dst file will have the same inode flags with dst file after
      file clone, and I think it's unexpected.
      
      For example, the dst file will suddenly become immutable after
      getting some share of data with src file, if the src is immutable.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dde820fb
    • L
      Btrfs: don't make a file partly checksummed through file clone · 0e7b824c
      Li Zefan 提交于
      To reproduce the bug:
      
        # mount /dev/sda7 /mnt
        # dd if=/dev/zero of=/mnt/src bs=4K count=1
        # umount /mnt
      
        # mount -o nodatasum /dev/sda7 /mnt
        # dd if=/dev/zero of=/mnt/dst bs=4K count=1
        # clone_range -s 4K -l 4K /mnt/src /mnt/dst
      
        # echo 3 > /proc/sys/vm/drop_caches
        # cat /mnt/dst
        # dmesg
        ...
        btrfs no csum found for inode 258 start 0
        btrfs csum failed ino 258 off 0 csum 2566472073 private 0
      
      It's because part of the file is checksummed and the other part is not,
      and then btrfs will complain checksum is not found when we read the file.
      
      Disallow file clone if src and dst file have different checksum flag,
      so we ensure a file is completely checksummed or unchecksummed.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0e7b824c
    • L
      Btrfs: fix pages truncation in btrfs_ioctl_clone() · 71ef0786
      Li Zefan 提交于
      It's a bug in commit f81c9cdc
      (Btrfs: truncate pages from clone ioctl target range)
      
      We should pass the dest range to the truncate function, but not the
      src range.
      
      Also move the function before locking extent state.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      71ef0786
    • H
      btrfs: fix d_off in the first dirent · 3765fefa
      Hidetoshi Seto 提交于
      Since the d_off in the first dirent for "." (that originates from
      the 4th argument "offset" of filldir() for the 2nd dirent for "..")
      is wrongly assigned in btrfs_real_readdir(), telldir returns same
      offset for different locations.
      
       | # mkfs.btrfs /dev/sdb1
       | # mount /dev/sdb1 fs0
       | # cd fs0
       | # touch file0 file1
       | # ../test
       | telldir: 0
       | readdir: d_off = 2, d_name = "."
       | telldir: 2
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       | telldir: 3
       | readdir: d_off = 2147483647, d_name = "file1"
       | telldir: 2147483647
      
      To fix this problem, pass filp->f_pos (which is loff_t) instead.
      
       | # ../test
       | telldir: 0
       | readdir: d_off = 1, d_name = "."
       | telldir: 1
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       :
      
      At the moment the "offset" for "." is unused because there is no
      preceding dirent, however it is better to pass filp->f_pos to follow
      grammatical usage.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3765fefa
  5. 13 9月, 2011 2 次提交
  6. 11 9月, 2011 11 次提交
  7. 10 9月, 2011 2 次提交
    • N
      Avoid dereferencing a 'request_queue' after last close. · 94007751
      NeilBrown 提交于
      On the last close of an 'md' device which as been stopped, the device
      is destroyed and in particular the request_queue is freed.  The free
      is done in a separate thread so it might happen a short time later.
      
      __blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
      called.
      
      Since commit f758eeab
      bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
      inside a request_queue, to get a spin lock.  This causes the last
      close on an md device to sometime take a spin_lock which lives in
      freed memory - which results in an oops.
      
      So move the called to bdev_inode_switch_bdi before the call to
      ->release.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      94007751
    • M
      vfs: automount should ignore LOOKUP_FOLLOW · 0ec26fd0
      Miklos Szeredi 提交于
      Prior to 2.6.38 automount would not trigger on either stat(2) or
      lstat(2) on the automount point.
      
      After 2.6.38, with the introduction of the ->d_automount()
      infrastructure, stat(2) and others would start triggering automount
      while lstat(2), etc. still would not.  This is a regression and a
      userspace ABI change.
      
      Problem originally reported here:
      
        http://thread.gmane.org/gmane.linux.kernel.autofs/6098
      
      It appears that there was an attempt at fixing various userspace tools
      to not trigger the automount.  But since the stat system call is
      rather common it is impossible to "fix" all userspace.
      
      This patch reverts the original behavior, which is to not trigger on
      stat(2) and other symlink following syscalls.
      
      [ It's not really clear what the right behavior is.  Apparently Solaris
        does the "automount on stat, leave alone on lstat".  And some programs
        can get unhappy when "stat+open+fstat" ends up giving a different
        result from the fstat than from the initial stat.
      
        But the change in 2.6.38 resulted in problems for some people, so
        we're going back to old behavior.  Maybe we can re-visit this
        discussion at some future date  - Linus ]
      Reported-by: NLeonardo Chiquitto <leonardo.lists@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: NIan Kent <raven@themaw.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ec26fd0
  8. 06 9月, 2011 5 次提交
  9. 01 9月, 2011 2 次提交
    • C
      xfs: fix ->write_inode return values · 58d84c4e
      Christoph Hellwig 提交于
      Currently we always redirty an inode that was attempted to be written out
      synchronously but has been cleaned by an AIL pushed internall, which is
      rather bogus.  Fix that by doing the i_update_core check early on and
      return 0 for it.  Also include async calls for it, as doing any work for
      those is just as pointless.  While we're at it also fix the sign for the
      EIO return in case of a filesystem shutdown, and fix the completely
      non-sensical locking around xfs_log_inode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97)
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      58d84c4e
    • C
      xfs: fix xfs_mark_inode_dirty during umount · 866e4ed7
      Christoph Hellwig 提交于
      During umount we do not add a dirty inode to the lru and wait for it to
      become clean first, but force writeback of data and metadata with
      I_WILL_FREE set.  Currently there is no way for XFS to detect that the
      inode has been redirtied for metadata operations, as we skip the
      mark_inode_dirty call during teardown.  Fix this by setting i_update_core
      nanually in that case, so that the inode gets flushed during inode reclaim.
      
      Alternatively we could enable calling mark_inode_dirty for inodes in
      I_WILL_FREE state, and let the VFS dirty tracking handle this.  I decided
      against this as we will get better I/O patterns from reclaim compared to
      the synchronous writeout in write_inode_now, and always marking the inode
      dirty in some way from xfs_mark_inode_dirty is a better safetly net in
      either case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      (cherry picked from commit da6742a5a4cc844a9982fdd936ddb537c0747856)
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      866e4ed7
  10. 31 8月, 2011 1 次提交
    • J
      ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining · 8c0bec21
      Jiaying Zhang 提交于
      The i_mutex lock and flush_completed_IO() added by commit 2581fdc8
      in ext4_evict_inode() causes lockdep complaining about potential
      deadlock in several places.  In most/all of these LOCKDEP complaints
      it looks like it's a false positive, since many of the potential
      circular locking cases can't take place by the time the
      ext4_evict_inode() is called; but since at the very least it may mask
      real problems, we need to address this.
      
      This change removes the flush_completed_IO() and i_mutex lock in
      ext4_evict_inode().  Instead, we take a different approach to resolve
      the software lockup that commit 2581fdc8 intends to fix.  Rather
      than having ext4-dio-unwritten thread wait for grabing the i_mutex
      lock of an inode, we use mutex_trylock() instead, and simply requeue
      the work item if we fail to grab the inode's i_mutex lock.
      
      This should speed up work queue processing in general and also
      prevents the following deadlock scenario: During page fault,
      shrink_icache_memory is called that in turn evicts another inode B.
      Inode B has some pending io_end work so it calls ext4_ioend_wait()
      that waits for inode B's i_ioend_count to become zero.  However, inode
      B's ioend work was queued behind some of inode A's ioend work on the
      same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
      thread on that cpu is processing inode A's ioend work, it tries to
      grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
      still hold before the page fault happened, we enter a deadlock.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8c0bec21
  11. 27 8月, 2011 1 次提交
  12. 26 8月, 2011 1 次提交
    • J
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer 提交于
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
      
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          vfs_readdir+0x5b/0xb4
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e096d0c7
  13. 25 8月, 2011 1 次提交
  14. 24 8月, 2011 1 次提交