1. 27 4月, 2009 6 次提交
  2. 25 4月, 2009 6 次提交
  3. 22 4月, 2009 1 次提交
    • C
      Btrfs: fix btrfs fallocate oops and deadlock · 546888da
      Chris Mason 提交于
      Btrfs fallocate was incorrectly starting a transaction with a lock held
      on the extent_io tree for the file, which could deadlock.  Strictly
      speaking it was using join_transaction which would be safe, but it is better
      to move the transaction outside of the lock.
      
      When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
      being called on an unlocked buffer.  This was triggering an assertion and
      oops because the lock is supposed to be held.
      
      The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
      been run.  btrfs_del_item takes care of dirtying things, so the solution is a
      to skip the btrfs_mark_buffer_dirty call in this case.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      546888da
  4. 21 4月, 2009 4 次提交
    • C
      Btrfs: use the right node in reada_for_balance · 8c594ea8
      Chris Mason 提交于
      reada_for_balance was using the wrong index into the path node array,
      so it wasn't reading the right blocks.  We never directly used the
      results of the read done by this function because the btree search is
      started over at the end.
      
      This fixes reada_for_balance to reada in the correct node and to
      avoid searching past the last slot in the node.  It also makes sure to
      hold the parent lock while we are finding the nodes to read.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8c594ea8
    • C
      Btrfs: fix oops on page->mapping->host during writepage · 11c8349b
      Chris Mason 提交于
      The extent_io writepage call updates the writepage index in the inode
      as it makes progress.  But, it was doing the update after unlocking the page,
      which isn't legal because page->mapping can't be trusted once the page
      is unlocked.
      
      This lead to an oops, especially common with compression turned on.  The
      fix here is to update the writeback index before unlocking the page.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      11c8349b
    • C
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason 提交于
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d313d7a3
    • C
      Btrfs: use WRITE_SYNC for synchronous writes · ffbd517d
      Chris Mason 提交于
      Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
      writes we plan on waiting on in the near future.  This patch
      mirrors recent changes in other filesystems and the generic code to
      use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
      other latency critical writes.
      
      Btrfs uses async worker threads for checksumming before the write is done,
      and then again to actually submit the bios.  The bio submission code just
      runs a per-device list of bios that need to be sent down the pipe.
      
      This list is split into low priority and high priority lists so the
      WRITE_SYNC IO happens first.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ffbd517d
  5. 03 4月, 2009 20 次提交
    • S
      Btrfs: BUG to BUG_ON changes · c293498b
      Stoyan Gaydarov 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c293498b
    • D
      Btrfs: remove dead code · 3e7ad38d
      Dan Carpenter 提交于
      Remove an unneeded return statement and conditional
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3e7ad38d
    • D
      Btrfs: remove dead code · ff0a5836
      Dan Carpenter 提交于
      merge is always NULL at this point.
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ff0a5836
    • W
      Btrfs: fix typos in comments · d4a78947
      Wu Fengguang 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d4a78947
    • J
      Btrfs: remove unused ftrace include · 2e966ed2
      Jim Owens 提交于
      Signed-off-by: Njim owens <jowens@hp.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2e966ed2
    • H
      Btrfs: fix __ucmpdi2 compile bug on 32 bit builds · 93dbfad7
      Heiko Carstens 提交于
      We get this on 32 builds:
      
      fs/built-in.o: In function `extent_fiemap':
      (.text+0x1019f2): undefined reference to `__ucmpdi2'
      
      Happens because of a switch statement with a 64 bit argument.
      Convert this to an if statement to fix this.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      93dbfad7
    • S
      Btrfs: free inode struct when btrfs_new_inode fails · 09771430
      Shen Feng 提交于
      btrfs_new_inode doesn't call iput to free the inode
      when it fails.
      Signed-off-by: NShen Feng <shen@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      09771430
    • A
      Btrfs: fix race in worker_loop · b5555f77
      Amit Gud 提交于
      Need to check kthread_should_stop after schedule_timeout() before calling
      schedule(). This causes threads to sleep with potentially no one to wake them
      up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop.
      Signed-off-by: NAmit Gud <gud@ksu.edu>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b5555f77
    • S
      Btrfs: add flushoncommit mount option · dccae999
      Sage Weil 提交于
      The 'flushoncommit' mount option forces any data dirtied by a write in a
      prior transaction to commit as part of the current commit.  This makes
      the committed state a fully consistent view of the file system from the
      application's perspective (i.e., it includes all completed file system
      operations).  This was previously the behavior only when a snapshot is
      created.
      
      This is used by Ceph to ensure that completed writes make it to the
      platter along with the metadata operations they are bound to (by
      BTRFS_IOC_TRANS_{START,END}).
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dccae999
    • S
      Btrfs: notreelog mount option · 3a5e1404
      Sage Weil 提交于
      Add a 'notreelog' mount option to disable the tree log (used by fsync,
      O_SYNC writes).  This is much slower, but the tree logging produces
      inconsistent views into the FS for ceph.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3a5e1404
    • E
      Btrfs: introduce btrfs_show_options · a9572a15
      Eric Paris 提交于
      btrfs options can change at times other than mount, yet /proc/mounts shows the
      options string used when the fs was mounted (an example would be when btrfs
      determines that barriers aren't useful and turns them off.)  This patch
      instead outputs the actual options in use by btrfs.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a9572a15
    • C
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason 提交于
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • C
      Btrfs: Optimize locking in btrfs_next_leaf() · 8e73f275
      Chris Mason 提交于
      btrfs_next_leaf was using blocking locks when it could have been using
      faster spinning ones instead.  This adds a few extra checks around
      the pieces that block and switches over to spinning locks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      
      8e73f275
    • C
      Btrfs: break up btrfs_search_slot into smaller pieces · c8c42864
      Chris Mason 提交于
      btrfs_search_slot was doing too many things at once.  This breaks
      it up into more reasonable units.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8c42864
    • J
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik 提交于
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      04018de5
    • J
      Btrfs: kill the block group alloc mutex · 6226cb0a
      Josef Bacik 提交于
      This patch removes the block group alloc mutex used to protect the free space
      tree for allocations and replaces it with a spin lock which is used only to
      protect the free space rb tree.  This means we only take the lock when we are
      directly manipulating the tree, which makes us a touch faster with
      multi-threaded workloads.
      
      This patch also gets rid of btrfs_find_free_space and replaces it with
      btrfs_find_space_for_alloc, which takes the number of bytes you want to
      allocate, and empty_size, which is used to indicate how much free space should
      be at the end of the allocation.
      
      It will return an offset for the allocator to use.  If we don't end up using it
      we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
      kill the alloc_mutex, since we need to make sure nobody else comes along and
      takes our space.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      6226cb0a
    • J
      Btrfs: clean up find_free_extent · 2552d17e
      Josef Bacik 提交于
      I've replaced the strange looping constructs with a list_for_each_entry on
      space_info->block_groups.  If we have a hint we just jump into the loop with
      the block group and start looking for space.  If we don't find anything we
      start at the beginning and start looking.  We never come out of the loop with a
      ref on the block_group _unless_ we found space to use, then we drop it after we
      set the trans block_group.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      2552d17e
    • J
      Btrfs: free space cache cleanups · 70cb0743
      Josef Bacik 提交于
      This patch cleans up the free space cache code a bit.  It better documents the
      idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
      I took out the info allocation at the start of __btrfs_add_free_space and put it
      where it makes more sense.  This was left over cruft from when alloc_mutex
      existed.  Also all of the re-searches we do to make sure we inserted properly.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      70cb0743
    • C
      Btrfs: unplug in the async bio submission threads · bedf762b
      Chris Mason 提交于
      Btrfs pages being written get set to writeback, and then may go through
      a number of steps before they hit the block layer.  This includes compression,
      checksumming and async bio submission.
      
      The end result is that someone who writes a page and then does
      wait_on_page_writeback is likely to unplug the queue before the bio they
      cared about got there.
      
      We could fix this by marking bios sync, or by doing more frequent unplugs,
      but this commit just changes the async bio submission code to unplug
      after it has processed all the bios for a device.  The async bio submission
      does a fair job of collection bios, so this shouldn't be a huge problem
      for reducing merging at the elevator.
      
      For streaming O_DIRECT writes on a 5 drive array, it boosts performance
      from 386MB/s to 460MB/s.
      
      Thanks to Hisashi Hifumi for helping with this work.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bedf762b
    • C
      Btrfs: keep processing bios for a given bdev if our proc is batching · b765ead5
      Chris Mason 提交于
      Btrfs uses async helper threads to submit write bios so the checksumming
      helper threads don't block on the disk.
      
      The submit bio threads may process bios for more than one block device,
      so when they find one device congested they try to move on to other
      devices instead of blocking in get_request_wait for one device.
      
      This does a pretty good job of keeping multiple devices busy, but the
      congested flag has a number of problems.  A congested device may still
      give you a request, and other procs that aren't backing off the congested
      device may starve you out.
      
      This commit uses the io_context stored in current to decide if our process
      has been made a batching process by the block layer.  If so, it keeps
      sending IO down for at least one batch.  This helps make sure we do
      a good amount of work each time we visit a bdev, and avoids large IO
      stalls in multi-device workloads.
      
      It's also very ugly.  A better solution is in the works with Jens Axboe.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b765ead5
  6. 01 4月, 2009 3 次提交
    • N
      fs: fix page_mkwrite error cases in core code and btrfs · 56a76f82
      Nick Piggin 提交于
      page_mkwrite is called with neither the page lock nor the ptl held.  This
      means a page can be concurrently truncated or invalidated out from
      underneath it.  Callers are supposed to prevent truncate races themselves,
      however previously the only thing they can do in case they hit one is to
      raise a SIGBUS.  A sigbus is wrong for the case that the page has been
      invalidated or truncated within i_size (eg.  hole punched).  Callers may
      also have to perform memory allocations in this path, where again, SIGBUS
      would be wrong.
      
      The previous patch ("mm: page_mkwrite change prototype to match fault")
      made it possible to properly specify errors.  Convert the generic buffer.c
      code and btrfs to return sane error values (in the case of page removed
      from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
      without doing anything, and the fault will be retried properly).
      
      This fixes core code, and converts btrfs as a template/example.  All other
      filesystems defining their own page_mkwrite should be fixed in a similar
      manner.
      Acked-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56a76f82
    • N
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin 提交于
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c
    • A
      New helper - current_umask() · ce3b0f8d
      Al Viro 提交于
      current->fs->umask is what most of fs_struct users are doing.
      Put that into a helper function.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ce3b0f8d