1. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7
  2. 15 5月, 2011 3 次提交
  3. 12 4月, 2011 1 次提交
  4. 05 4月, 2011 2 次提交
  5. 28 3月, 2011 4 次提交
  6. 24 3月, 2011 1 次提交
  7. 18 3月, 2011 1 次提交
    • J
      Btrfs: handle errors in btrfs_orphan_cleanup · 66b4ffd1
      Josef Bacik 提交于
      If we cannot truncate an inode for some reason we will never delete the orphan
      item associated with that inode, which means that we will loop forever in
      btrfs_orphan_cleanup.  Instead of doing this just return error so we fail to
      mount.  It sucks, but hey it's better than hanging.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      66b4ffd1
  8. 17 2月, 2011 1 次提交
  9. 15 2月, 2011 1 次提交
    • D
      btrfs: prevent heap corruption in btrfs_ioctl_space_info() · 51788b1b
      Dan Rosenberg 提交于
      Commit bf5fc093 refactored
      btrfs_ioctl_space_info() and introduced several security issues.
      
      space_args.space_slots is an unsigned 64-bit type controlled by a
      possibly unprivileged caller.  The comparison as a signed int type
      allows providing values that are treated as negative and cause the
      subsequent allocation size calculation to wrap, or be truncated to 0.
      By providing a size that's truncated to 0, kmalloc() will return
      ZERO_SIZE_PTR.  It's also possible to provide a value smaller than the
      slot count.  The subsequent loop ignores the allocation size when
      copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.
      
      The fix changes the slot count type and comparison typecast to u64,
      which prevents truncation or signedness errors, and also ensures that we
      don't copy more data than we've allocated in the subsequent loop.  Note
      that zero-size allocations are no longer possible since there is already
      an explicit check for space_args.space_slots being 0 and truncation of
      this value is no longer an issue.
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      51788b1b
  10. 01 2月, 2011 1 次提交
  11. 29 1月, 2011 2 次提交
  12. 27 1月, 2011 1 次提交
    • L
      Btrfs: Fix file clone when source offset is not 0 · 4d728ec7
      Li Zefan 提交于
      Suppose:
      - the source extent is: [0, 100]
      - the src offset is 10
      - the clone length is 90
      - the dest offset is 0
      
      This statement:
      
      	new_key.offset = key.offset + destoff - off
      
      will produce such an extent for the dest file:
      
      	[ino, BTRFS_EXTENT_DATA_KEY, -10]
      
      , which is obviously wrong.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      4d728ec7
  13. 23 12月, 2010 3 次提交
    • L
      Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls · 0caa102d
      Li Zefan 提交于
      This allows us to set a snapshot or a subvolume readonly or writable
      on the fly.
      
      Usage:
      
      Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
      call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);
      
      Changelog for v3:
      
      - Change to pass __u64 as ioctl parameter.
      
      Changelog for v2:
      
      - Add _GETFLAGS ioctl.
      - Check if the passed fd is the root of a subvolume.
      - Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      0caa102d
    • L
      Btrfs: Add readonly snapshots support · b83cc969
      Li Zefan 提交于
      Usage:
      
      Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
      ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
      
      Implementation:
      
      - Set readonly bit of btrfs_root_item->flags.
      - Add readonly checks in btrfs_permission (inode_permission),
      btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
      
      Changelog for v3:
      
      - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
      - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      b83cc969
    • L
      Btrfs: Refactor btrfs_ioctl_snap_create() · fa0d2b9b
      Li Zefan 提交于
      Split it into two functions for two different ioctls, since they
      share no common code.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      fa0d2b9b
  14. 22 12月, 2010 2 次提交
  15. 11 12月, 2010 2 次提交
  16. 22 11月, 2010 3 次提交
  17. 30 10月, 2010 9 次提交
    • S
      Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed · 4260f7c7
      Sage Weil 提交于
      Add a mount option user_subvol_rm_allowed that allows users to delete a
      (potentially non-empty!) subvol when they would otherwise we allowed to do
      an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
      to implement identical security checks (minus the directory size check).
      We additionally require that the user has write+exec permission on the
      subvol root inode.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4260f7c7
    • S
      Btrfs: make SNAP_DESTROY async · 531cb13f
      Sage Weil 提交于
      There is no reason to force an immediate commit when deleting a snapshot.
      Users have some expectation that space from a deleted snapshot be freed
      immediately, but even if we do commit the reclaim is a background process.
      
      If users _do_ want the deletion to be durable, they can call 'sync'.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      531cb13f
    • S
      Btrfs: add SNAP_CREATE_ASYNC ioctl · 72fd032e
      Sage Weil 提交于
      Create a snap without waiting for it to commit to disk.  The ioctl is
      ordered such that subsequent operations will not be contained by the
      created snapshot, and the commit is initiated, but the ioctl does not
      wait for the snapshot to commit to disk.
      
      We return the specific transid to userspace so that an application can wait
      for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      72fd032e
    • S
      Btrfs: add START_SYNC, WAIT_SYNC ioctls · 46204592
      Sage Weil 提交于
      START_SYNC will start a sync/commit, but not wait for it to
      complete.  Any modification started after the ioctl returns is
      guaranteed not to be included in the commit.  If a non-NULL
      pointer is passed, the transaction id will be returned to
      userspace.
      
      WAIT_SYNC will wait for any in-progress commit to complete.  If a
      transaction id is specified, the ioctl will block and then
      return (success) when the specified transaction has committed.
      If it has already committed when we call the ioctl, it returns
      immediately.  If the specified transaction doesn't exist, it
      returns EINVAL.
      
      If no transaction id is specified, WAIT_SYNC will wait for the
      currently committing transaction to finish it's commit to disk.
      If there is no currently committing transaction, it returns
      success.
      
      These ioctls are useful for applications which want to impose an
      ordering on when fs modifications reach disk, but do not want to
      wait for the full (slow) commit process to do so.
      
      Picky callers can take the transid returned by START_SYNC and
      feed it to WAIT_SYNC, and be certain to wait only as long as
      necessary for the transaction _they_ started to reach disk.
      
      Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
      and provided they didn't wait too long between the calls, they
      will get the same result.  However, if a second commit starts
      before they call WAIT_SYNC, they may end up waiting longer for
      it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
      guarantees that any operation completed before the START_SYNC
      reaches disk.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      46204592
    • S
      Btrfs: fix lockdep warning on clone ioctl · fccdae43
      Sage Weil 提交于
      I'm no lockdep expert, but this appears to make the lockdep warning go
      away for the i_mutex locking in the clone ioctl.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fccdae43
    • S
      Btrfs: fix clone ioctl where range is adjacent to extent · 050006a7
      Sage Weil 提交于
      We had an edge case issue where the requested range was just
      following an existing extent. Instead of skipping to the next
      extent, we used the previous one which lead to having zero
      sized extents.
      Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      050006a7
    • S
      Btrfs: fix delalloc checks in clone ioctl · 9a019196
      Sage Weil 提交于
      The lookup_first_ordered_extent() was done on the wrong inode, and the
      ->delalloc_bytes test was wrong, as the following
      btrfs_wait_ordered_range() would only invoke a range write and wouldn't
      write the entire file data range. Also, a bad parameter was passed to
      btrfs_wait_ordered_range().
      Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9a019196
    • A
      Btrfs: cleanup warnings from gcc 4.6 (nonbugs) · 559af821
      Andi Kleen 提交于
      These are all the cases where a variable is set, but not read which are
      not bugs as far as I can see, but simply leftovers.
      
      Still needs more review.
      
      Found by gcc 4.6's new warnings
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      559af821
    • J
      Btrfs: use memdup_user helpers · 2354d08f
      Julia Lawall 提交于
      Use memdup_user when user data is immediately copied into the
      allocated region.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression from,to,size,flag;
      position p;
      identifier l1,l2;
      @@
      
      -  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
      +  to = memdup_user(from,size);
         if (
      -      to==NULL
      +      IS_ERR(to)
                       || ...) {
         <+... when != goto l1;
      -  -ENOMEM
      +  PTR_ERR(to)
         ...+>
         }
      -  if (copy_from_user(to, from, size) != 0) {
      -    <+... when != goto l2;
      -    -EFAULT
      -    ...+>
      -  }
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2354d08f
  18. 23 10月, 2010 1 次提交
    • J
      Btrfs: fix the df ioctl to report raid types · bf5fc093
      Josef Bacik 提交于
      The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
      info's for each RAID type.  So instead, loop through each space info's raid
      lists so we can get the right RAID information which will allow the df ioctl to
      tell us RAID types again.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bf5fc093
  19. 20 7月, 2010 1 次提交
    • D
      Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE · 2ebc3464
      Dan Rosenberg 提交于
      1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
      whether the donor file is append-only before writing to it.
      
      2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
      overflow that allows a user to specify an out-of-bounds range to copy
      from the source file (if off + len wraps around).  I haven't been able
      to successfully exploit this, but I'd imagine that a clever attacker
      could use this to read things he shouldn't.  Even if it's not
      exploitable, it couldn't hurt to be safe.
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      cc: stable@kernel.org
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ebc3464