1. 01 4月, 2009 2 次提交
    • N
      fs: fix page_mkwrite error cases in core code and btrfs · 56a76f82
      Nick Piggin 提交于
      page_mkwrite is called with neither the page lock nor the ptl held.  This
      means a page can be concurrently truncated or invalidated out from
      underneath it.  Callers are supposed to prevent truncate races themselves,
      however previously the only thing they can do in case they hit one is to
      raise a SIGBUS.  A sigbus is wrong for the case that the page has been
      invalidated or truncated within i_size (eg.  hole punched).  Callers may
      also have to perform memory allocations in this path, where again, SIGBUS
      would be wrong.
      
      The previous patch ("mm: page_mkwrite change prototype to match fault")
      made it possible to properly specify errors.  Convert the generic buffer.c
      code and btrfs to return sane error values (in the case of page removed
      from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
      without doing anything, and the fault will be retried properly).
      
      This fixes core code, and converts btrfs as a template/example.  All other
      filesystems defining their own page_mkwrite should be fixed in a similar
      manner.
      Acked-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56a76f82
    • N
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin 提交于
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c
  2. 21 2月, 2009 1 次提交
    • J
      Btrfs: add better -ENOSPC handling · 6a63209f
      Josef Bacik 提交于
      This is a step in the direction of better -ENOSPC handling.  Instead of
      checking the global bytes counter we check the space_info bytes counters to
      make sure we have enough space.
      
      If we don't we go ahead and try to allocate a new chunk, and then if that fails
      we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
      bytes_delalloc and bytes_may_use.
      
      bytes_delalloc account for extents we've actually setup for delalloc and will
      be allocated at some point down the line. 
      
      bytes_may_use is to keep track of how many bytes we may use for delalloc at
      some point.  When we actually set the extent_bit for the delalloc bytes we
      subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
      not actually being able to allocate space for any delalloc bytes.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      
      
      6a63209f
  3. 13 2月, 2009 1 次提交
    • J
      Btrfs: remove btrfs_init_path · e00f7308
      Jeff Mahoney 提交于
      btrfs_init_path was initially used when the path objects were on the
      stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
      isn't required.
      
      This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e00f7308
  4. 12 2月, 2009 1 次提交
  5. 07 2月, 2009 1 次提交
  6. 04 2月, 2009 6 次提交
    • C
      Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode · 06d9a8d7
      Chris Mason 提交于
      btrfs_truncate_inode_items is setup to stop doing btree searches when
      it has finished removing the items for the inode.  It used to detect the
      end of the inode by looking for an objectid that didn't match the
      one we were searching for.
      
      But, this would result in an extra search through the btree, which
      adds extra balancing and cow costs to the operation.
      
      This commit adds a check to see if we found the inode item, which means
      we can stop searching early.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      06d9a8d7
    • C
      Btrfs: Don't try to compress pages past i_size · f03d9301
      Chris Mason 提交于
      The compression code had some checks to make sure we were only
      compressing bytes inside of i_size, but it wasn't catching every
      case.  To make things worse, some incorrect math about the number
      of bytes remaining would make it try to compress more pages than the
      file really had.
      
      The fix used here is to fall back to the non-compression code in this
      case, which does all the proper cleanup of delalloc and other accounting.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f03d9301
    • C
      Btrfs: Handle SGID bit when creating inodes · 8c087b51
      Chris Ball 提交于
      Before this patch, new files/dirs would ignore the SGID bit on their
      parent directory and always be owned by the creating user's uid/gid.
      Signed-off-by: NChris Ball <cjb@laptop.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      
      8c087b51
    • C
      Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302
      Chris Mason 提交于
      Every transaction in btrfs creates a new snapshot, and then schedules the
      snapshot from the last transaction for deletion.  Snapshot deletion
      works by walking down the btree and dropping the reference counts
      on each btree block during the walk.
      
      If if a given leaf or node has a reference count greater than one,
      the reference count is decremented and the subtree pointed to by that
      node is ignored.
      
      If the reference count is one, walking continues down into that node
      or leaf, and the references of everything it points to are decremented.
      
      The old code would try to work in small pieces, walking down the tree
      until it found the lowest leaf or node to free and then returning.  This
      was very friendly to the rest of the FS because it didn't have a huge
      impact on other operations.
      
      But it wouldn't always keep up with the rate that new commits added new
      snapshots for deletion, and it wasn't very optimal for the extent
      allocation tree because it wasn't finding leaves that were close together
      on disk and processing them at the same time.
      
      This changes things to walk down to a level 1 node and then process it
      in bulk.  All the leaf pointers are sorted and the leaves are dropped
      in order based on their extent number.
      
      The extent allocation tree and commit code are now fast enough for
      this kind of bulk processing to work without slowing the rest of the FS
      down.  Overall it does less IO and is better able to keep up with
      snapshot deletions under high load.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd56b302
    • C
      Btrfs: Change btree locking to use explicit blocking points · b4ce94de
      Chris Mason 提交于
      Most of the btrfs metadata operations can be protected by a spinlock,
      but some operations still need to schedule.
      
      So far, btrfs has been using a mutex along with a trylock loop,
      most of the time it is able to avoid going for the full mutex, so
      the trylock loop is a big performance gain.
      
      This commit is step one for getting rid of the blocking locks entirely.
      btrfs_tree_lock takes a spinlock, and the code explicitly switches
      to a blocking lock when it starts an operation that can schedule.
      
      We'll be able get rid of the blocking locks in smaller pieces over time.
      Tracing allows us to find the most common cause of blocking, so we
      can start with the hot spots first.
      
      The basic idea is:
      
      btrfs_tree_lock() returns with the spin lock held
      
      btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
      the extent buffer flags, and then drops the spin lock.  The buffer is
      still considered locked by all of the btrfs code.
      
      If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
      the spin lock and waits on a wait queue for the blocking bit to go away.
      
      Much of the code that needs to set the blocking bit finishes without actually
      blocking a good percentage of the time.  So, an adaptive spin is still
      used against the blocking bit to avoid very high context switch rates.
      
      btrfs_clear_lock_blocking() clears the blocking bit and returns
      with the spinlock held again.
      
      btrfs_tree_unlock() can be called on either blocking or spinning locks,
      it does the right thing based on the blocking bit.
      
      ctree.c has a helper function to set/clear all the locked buffers in a
      path as blocking.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4ce94de
    • J
      Btrfs: selinux support · 0279b4cd
      Jim Owens 提交于
      Add call to LSM security initialization and save
      resulting security xattr for new inodes.
      
      Add xattr support to symlink inode ops.
      
      Set inode->i_op for existing special files.
      Signed-off-by: Njim owens <jowens@hp.com>
      0279b4cd
  7. 29 1月, 2009 1 次提交
    • C
      Btrfs: fix readdir on 32 bit machines · 89f135d8
      Chris Mason 提交于
      After btrfs_readdir has gone through all the directory items, it
      sets the directory f_pos to the largest possible int.  This way
      applications that mix readdir with creating new files don't
      end up in an endless loop finding the new directory items as they go.
      
      It was a workaround for a bug in git, but the assumption was that if git
      could make this looping mistake than it would be a common problem.
      
      The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
      and it is possible for that to be a larger number than 32 bit glibc
      expects to come out of readdir.
      
      This patches switches that to INT_LIMIT(off_t), which should keep
      applications happy on 32 and 64 bit machines.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89f135d8
  8. 22 1月, 2009 2 次提交
    • Y
      Btrfs: fiemap support · 1506fcc8
      Yehuda Sadeh 提交于
      Now that bmap support is gone, this is the only way to get extent
      mappings for userland.  These are still not valid for IO, but they
      can tell us if a file has holes or how much fragmentation there is.
      Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      1506fcc8
    • C
      Btrfs: stop providing a bmap operation to avoid swapfile corruptions · 35054394
      Chris Mason 提交于
      Swapfiles use bmap to build a list of extents belonging to the file,
      and they assume these extents won't change over the life of the file.
      They also use resulting list to do IO directly to the block device.
      
      This causes problems for btrfs in a few ways:
      
      btrfs returns logical block numbers through bmap, and these are not suitable
      for IO.  They might translate to different devices, raid etc.
      
      COW means that file block mappings are going to change frequently.
      
      Using swapfiles on btrfs will lead to corruption, so we're avoiding the
      problem for now by dropping bmap support entirely.  A later commit
      will add fiemap support for people that really want to know how
      a file is laid out.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      35054394
  9. 21 1月, 2009 2 次提交
  10. 07 1月, 2009 3 次提交
  11. 06 1月, 2009 2 次提交
    • Y
      Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation · 180591bc
      Yan Zheng 提交于
      Snapshot creation happens at a specific time during transaction commit.  We
      need to make sure the code called by snapshot creation doesn't wait
      for the running transaction to commit.
      
      This changes btrfs_delete_inode and finish_pending_snaps to use
      btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.
      
      It would be better if btrfs_delete_inode didn't use the join, but the
      call path that triggers it is:
      
      btrfs_commit_transaction->create_pending_snapshots->
      create_pending_snapshot->btrfs_lookup_dentry->
      fixup_tree_root_location->btrfs_read_fs_root->
      btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput
      
      This will be fixed in a later patch by moving the orphan cleanup to the
      cleaner thread.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      180591bc
    • C
      Btrfs: Fix checkpatch.pl warnings · d397712b
      Chris Mason 提交于
      There were many, most are fixed now.  struct-funcs.c generates some warnings
      but these are bogus.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d397712b
  12. 18 12月, 2008 1 次提交
    • C
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason 提交于
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cad321ad
  13. 16 12月, 2008 2 次提交
  14. 12 12月, 2008 2 次提交
    • Y
      Btrfs: fix nodatasum handling in balancing code · 17d217fe
      Yan Zheng 提交于
      Checksums on data can be disabled by mount option, so it's
      possible some data extents don't have checksums or have
      invalid checksums. This causes trouble for data relocation.
      This patch contains following things to make data relocation
      work.
      
      1) make nodatasum/nodatacow mount option only affects new
      files. Checksums and COW on data are only controlled by the
      inode flags.
      
      2) check the existence of checksum in the nodatacow checker.
      If checksums exist, force COW the data extent. This ensure that
      checksum for a given block is either valid or does not exist.
      
      3) update data relocation code to properly handle the case
      of checksum missing.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      17d217fe
    • Y
      Btrfs: fix leaking block group on balance · d2fb3437
      Yan Zheng 提交于
      The block group structs are referenced in many different
      places, and it's not safe to free while balancing.  So, those block
      group structs were simply leaked instead.
      
      This patch replaces the block group pointer in the inode with the starting byte
      offset of the block group and adds reference counting to the block group
      struct.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d2fb3437
  15. 09 12月, 2008 2 次提交
    • C
      Btrfs: Add inode sequence number for NFS and reserved space in a few structs · c3027eb5
      Chris Mason 提交于
      This adds a sequence number to the btrfs inode that is increased on
      every update.  NFS will be able to use that to detect when an inode has
      changed, without relying on inaccurate time fields.
      
      While we're here, this also:
      
      Puts reserved space into the super block and inode
      
      Adds a log root transid to the super so we can pick the newest super
      based on the fsync log as well as the main transaction ID.  For now
      the log root transid is always zero, but that'll get fixed.
      
      Adds a starting offset to the dev_item.  This will let us do better
      alignment calculations if we know the start of a partition on the disk.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3027eb5
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  16. 02 12月, 2008 3 次提交
  17. 20 11月, 2008 3 次提交
    • C
      Btrfs: compat code fixes · 4b4e25f2
      Chris Mason 提交于
      The btrfs git kernel trees is used to build a standalone tree for
      compiling against older kernels.  This commit makes the standalone tree
      work with 2.6.27
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b4e25f2
    • C
      Btrfs: Use current_fsuid/gid · 79683f2d
      Chris Mason 提交于
      This fixes compile problems with linux-next
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      79683f2d
    • C
      Btrfs: Avoid writeback stalls · d2c3f4f6
      Chris Mason 提交于
      While building large bios in writepages, btrfs may end up waiting
      for other page writeback to finish if WB_SYNC_ALL is used.
      
      While it is waiting, the bio it is building has a number of pages with the
      writeback bit set and they aren't getting to the disk any time soon.  This
      lowers the latencies of writeback in general by sending down the bio being
      built before waiting for other pages.
      
      The bio submission code tries to limit the total number of async bios in
      flight by waiting when we're over a certain number of async bios.  But,
      the waits are happening while writepages is building bios, and this can easily
      lead to stalls and other problems for people calling wait_on_page_writeback.
      
      The current fix is to let the congestion tests take care of waiting.
      
      sync() and others make sure to drain the current async requests to make
      sure that everything that was pending when the sync was started really get
      to disk.  The code would drain pending requests both before and after
      submitting a new request.
      
      But, if one of the requests is waiting for page writeback to finish,
      the draining waits might block that page writeback.  This changes the
      draining code to only wait after submitting the bio being processed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d2c3f4f6
  18. 18 11月, 2008 3 次提交
    • C
      Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af
      Chris Mason 提交于
      Subvols and snapshots can now be referenced from any point in the directory
      tree.  We need to maintain back refs for them so we can find lost
      subvols.
      
      Forward refs are added so that we know all of the subvols and
      snapshots referenced anywhere in the directory tree of a single subvol.  This
      can be used to do recursive snapshotting (but they aren't yet) and it is
      also used to detect and prevent directory loops when creating new snapshots.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0660b5af
    • C
      Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160
      Chris Mason 提交于
      Each subvolume has its own private inode number space, and so we need
      to fill in different device numbers for each subvolume to avoid confusing
      applications.
      
      This commit puts a struct super_block into struct btrfs_root so it can
      call set_anon_super() and get a different device number generated for
      each root.
      
      btrfs_rename is changed to prevent renames across subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3394e160
    • C
      Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c
      Chris Mason 提交于
      Before, all snapshots and subvolumes lived in a single flat directory.  This
      was awkward and confusing because the single flat directory was only writable
      with the ioctls.
      
      This commit changes the ioctls to create subvols and snapshots at any
      point in the directory tree.  This requires making separate ioctls for
      snapshot and subvol creation instead of a combining them into one.
      
      The subvol ioctl does:
      
      btrfsctl -S subvol_name parent_dir
      
      After the ioctl is done subvol_name lives inside parent_dir.
      
      The snapshot ioctl does:
      
      btrfsctl -s path_for_snapshot root_to_snapshot
      
      path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
      into directory and basename components.
      
      root_to_snapshot can be any file or directory in the FS.  The snapshot
      is taken of the entire root where that file lives.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3de4586c
  19. 13 11月, 2008 1 次提交
    • Y
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng 提交于
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      c146afad
  20. 11 11月, 2008 1 次提交