1. 13 1月, 2011 1 次提交
  2. 07 1月, 2011 5 次提交
    • N
      btrfs: provide simple rcu-walk ACL implementation · 258a5aa8
      Nick Piggin 提交于
      This simple implementation just checks for no ACLs on the inode, and
      if so, then the rcu-walk may proceed, otherwise fail it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      258a5aa8
    • N
      fs: provide rcu-walk aware permission i_ops · b74c79e9
      Nick Piggin 提交于
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b74c79e9
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
  3. 21 12月, 2010 1 次提交
  4. 14 12月, 2010 3 次提交
    • C
      Btrfs: prevent RAID level downgrades when space is low · 83a50de9
      Chris Mason 提交于
      The extent allocator has code that allows us to fill
      allocations from any available block group, even if it doesn't
      match the raid level we've requested.
      
      This was put in because adding a new drive to a filesystem
      made with the default mkfs options actually upgrades the metadata from
      single spindle dup to full RAID1.
      
      But, the code also allows us to allocate from a raid0 chunk when we
      really want a raid1 or raid10 chunk.  This can cause big trouble because
      mkfs creates a small (4MB) raid0 chunk for data and metadata which then
      goes unused for raid1/raid10 installs.
      
      The allocator will happily wander in and allocate from that chunk when
      things get tight, which is not correct.
      
      The fix here is to make sure that we provide duplication when the
      caller has asked for it.  It does all the dups to be any raid level,
      which preserves the dup->raid1 upgrade abilities.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      83a50de9
    • C
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason 提交于
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      allocations.
      
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd02dca5
    • C
      Btrfs: EIO when we fail to read tree roots · 68433b73
      Chris Mason 提交于
      If we just get a plain IO error when we read tree roots, the code
      wasn't properly sending that error up the chain.  This allowed mounts to
      continue when they should failed, and allowed operations
      on partially setup root structs.  The end result was usually oopsen
      on spinlocks that hadn't been spun up correctly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68433b73
  5. 11 12月, 2010 7 次提交
  6. 10 12月, 2010 4 次提交
  7. 29 11月, 2010 2 次提交
  8. 28 11月, 2010 6 次提交
    • J
      Btrfs: setup blank root and fs_info for mount time · 450ba0ea
      Josef Bacik 提交于
      There is a problem with how we use sget, it searches through the list of supers
      attached to the fs_type looking for a super with the same fs_devices as what
      we're trying to mount.  This depends on sb->s_fs_info being filled, but we don't
      fill that in until we get to btrfs_fill_super, so we could hit supers on the
      fs_type super list that have a null s_fs_info.  In order to fix that we need to
      go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
      way our test will work out right and then we can set s_fs_info in
      btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
      fs_info when setting everything up.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      450ba0ea
    • J
      Btrfs: fix fiemap · 975f84fe
      Josef Bacik 提交于
      There are two big problems currently with FIEMAP
      
      1) We return extents for holes.  This isn't supposed to happen, we just don't
      return extents for holes and then userspace interprets the lack of an extent as
      a hole.
      
      2) We sometimes don't set FIEMAP_EXTENT_LAST properly.  This is because we wait
      to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
      fiemap to map up to the last extent in a file, and there is nothing but holes up
      to the i_size.  To fix this we need to lookup the last extent in this file and
      save the logical offset, so if we happen to try and map that extent we can be
      sure to set FIEMAP_EXTENT_LAST.
      
      With this patch we now pass xfstest 225, which we never have before.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      975f84fe
    • I
      Btrfs - fix race between btrfs_get_sb() and umount · 619c8c76
      Ian Kent 提交于
      When mounting a btrfs file system btrfs_test_super() may attempt to
      use sb->s_fs_info, the btrfs root, of a super block that is going away
      and that has had the btrfs root set to NULL in its ->put_super(). But
      if the super block is going away it cannot be an existing super block
      so we can return false in this case.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      619c8c76
    • J
      Btrfs: update inode ctime when using links · bc1cbf1f
      Josef Bacik 提交于
      Currently we fail xfstest 236 because we're not updating the inode ctime on
      link.  This is a simple fix, and makes it so we pass 236 now.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bc1cbf1f
    • J
      Btrfs: make sure new inode size is ok in fallocate · 0ed42a63
      Josef Bacik 提交于
      We have been failing xfstest 228 forever, because we don't check to make sure
      the new inode size is acceptable as far as RLIMIT is concerned.  Just check to
      make sure it's ok to create a inode with this new size and error out if not.
      With this patch we now pass 228.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0ed42a63
    • J
      Btrfs: fix typo in fallocate to make it honor actual size · 55a61d1d
      Josef Bacik 提交于
      There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
      actual_len/cur_offset, and then just set it to cur_offset again, and do the same
      with btrfs_ordered_update_i_size().  This fixes it back to keeping i_size in a
      local variable and then updating i_size properly.  Tested this with
      
      xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
      
      stat'ing foo gives us a size of 1 instead of 4096 like it was.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      55a61d1d
  9. 22 11月, 2010 11 次提交