1. 15 6月, 2012 4 次提交
    • J
      Btrfs: use rcu to protect device->name · 606686ee
      Josef Bacik 提交于
      Al pointed out that we can just toss out the old name on a device and add a
      new one arbitrarily, so anybody who uses device->name in printk could
      possibly use free'd memory.  Instead of adding locking around all of this he
      suggested doing it with RCU, so I've introduced a struct rcu_string that
      does just that and have gone through and protected all accesses to
      device->name that aren't under the uuid_mutex with rcu_read_lock().  This
      protects us and I will use it for dealing with removing the device that we
      used to mount the file system in a later patch.  Thanks,
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      606686ee
    • J
      Btrfs: fix btrfs_destroy_marked_extents · ee670f0a
      Josef Bacik 提交于
      So we're forcing the eb's to have their ref count set to 1 so invalidatepage
      works but this breaks lots of things, for example root nodes, and is just
      plain wrong, we don't need to just evict all of this stuff.  Also drop the
      invalidatepage altogether and add a page_cache_release().  With this patch
      we no longer hang when trying to access the root nodes after an aborted
      transaction and we no longer leak memory.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ee670f0a
    • J
      Btrfs: wake up transaction waiters when aborting a transaction · d7096fc3
      Josef Bacik 提交于
      I was getting lots of hung tasks and a NULL pointer dereference because we
      are not cleaning up the transaction properly when it aborts.  First we need
      to reset the running_transaction to NULL so we don't get a bad dereference
      for any start_transaction callers after this.  Also we cannot rely on
      waitqueue_active() since it's just a list_empty(), so just call wake_up()
      directly since that will do the barrier for us and such.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d7096fc3
    • J
      Btrfs: fix locking in btrfs_destroy_delayed_refs · b939d1ab
      Josef Bacik 提交于
      The transaction abort stuff was throwing warnings from the list debugging
      code because we do a list_del_init outside of the delayed_refs spin lock.
      The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
      but we need to take the ref head mutex to make sure it's not being processed
      currently, and so if it is we need to drop the spin lock and then take and
      drop the mutex and do the search again.  If we can take the mutex then we
      can safely remove the head from the list and carry on.  Now when the
      transaction aborts I don't get the list debugging warnings.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b939d1ab
  2. 30 5月, 2012 6 次提交
    • S
      Btrfs: read device stats on mount, write modified ones during commit · 733f4fbb
      Stefan Behrens 提交于
      The device statistics are written into the device tree with each
      transaction commit. Only modified statistics are written.
      When a filesystem is mounted, the device statistics for each involved
      device are read from the device tree and used to initialize the
      counters.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      733f4fbb
    • S
      Btrfs: add device counters for detected IO and checksum errors · 442a4f63
      Stefan Behrens 提交于
      The goal is to detect when drives start to get an increased error rate,
      when drives should be replaced soon. Therefore statistic counters are
      added that count IO errors (read, write and flush). Additionally, the
      software detected errors like checksum errors and corrupted blocks are
      counted.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      442a4f63
    • A
      btrfs: Drop unused function btrfs_abort_devices() · d07eb911
      Asias He 提交于
      1) This function is not used anywhere.
      
      2) Using the blk_abort_queue() to abort the queue seems not correct.
      blk_abort_queue() is used for timeout handling (block/blk-timeout.c).
      
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NAsias He <asias@redhat.com>
      d07eb911
    • J
      Btrfs: fix how we deal with the orphan block rsv · 8a35d95f
      Josef Bacik 提交于
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Btrfs: fix how we deal with the orphan block rsv
      
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8a35d95f
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
    • J
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik 提交于
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5fd02043
  3. 26 5月, 2012 2 次提交
  4. 06 5月, 2012 1 次提交
    • C
      Btrfs: avoid sleeping in verify_parent_transid while atomic · b9fab919
      Chris Mason 提交于
      verify_parent_transid needs to lock the extent range to make
      sure no IO is underway, and so it can safely clear the
      uptodate bits if our checks fail.
      
      But, a few callers are using it with spinlocks held.  Most
      of the time, the generation numbers are going to match, and
      we don't want to switch to a blocking lock just for the error
      case.  This adds an atomic flag to verify_parent_transid,
      and changes it to return EAGAIN if it needs to block to
      properly verifiy things.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9fab919
  5. 19 4月, 2012 2 次提交
  6. 30 3月, 2012 1 次提交
  7. 29 3月, 2012 3 次提交
  8. 27 3月, 2012 6 次提交
    • J
      Btrfs: deal with read errors on extent buffers differently · ea466794
      Josef Bacik 提交于
      Since we need to read and write extent buffers in their entirety we can't use
      the normal bio_readpage_error stuff since it only works on a per page basis.  So
      instead make it so that if we see an io error in endio we just mark the eb as
      having an IO error and then in btree_read_extent_buffer_pages we will manually
      try other mirrors and then overwrite the bad mirror if we find a good copy.
      This works with larger than page size blocks.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ea466794
    • C
      Btrfs: don't use threaded IO completion helpers for metadata writes · f3f266ab
      Chris Mason 提交于
      The metadata write IO completion code is now simple enough that we
      don't need the threaded helpers anymore.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f3f266ab
    • J
      Btrfs: ensure an entire eb is written at once · 0b32f4bb
      Josef Bacik 提交于
      This patch simplifies how we track our extent buffers.  Previously we could exit
      writepages with only having written half of an extent buffer, which meant we had
      to track the state of the pages and the state of the extent buffers differently.
      Now we only read in entire extent buffers and write out entire extent buffers,
      this allows us to simply set bits in our bflags to indicate the state of the eb
      and we no longer have to do things like track uptodate with our iotree.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0b32f4bb
    • J
      Btrfs: introduce free_extent_buffer_stale · 3083ee2e
      Josef Bacik 提交于
      Because btrfs cow's we can end up with extent buffers that are no longer
      necessary just sitting around in memory.  So instead of evicting these pages, we
      could end up evicting things we actually care about.  Thus we have
      free_extent_buffer_stale for use when we are freeing tree blocks.  This will
      make it so that the ref for the eb being in the radix tree is dropped as soon as
      possible and then is freed when the refcount hits 0 instead of waiting to be
      released by releasepage.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3083ee2e
    • J
      Btrfs: set page->private to the eb · 4f2de97a
      Josef Bacik 提交于
      We spend a lot of time looking up extent buffers from pages when we could just
      store the pointer to the eb the page is associated with in page->private.  This
      patch does just that, and it makes things a little simpler and reduces a bit of
      CPU overhead involved with doing metadata IO.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4f2de97a
    • C
      Btrfs: allow metadata blocks larger than the page size · 727011e0
      Chris Mason 提交于
      A few years ago the btrfs code to support blocks lager than
      the page size was disabled to fix a few corner cases in the
      page cache handling.  This fixes the code to properly support
      large metadata blocks again.
      
      Since current kernels will crash early and often with larger
      metadata blocks, this adds an incompat bit so that older kernels
      can't mount it.
      
      This also does away with different blocksizes for nodes and leaves.
      You get a single block size for all tree blocks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      727011e0
  9. 22 3月, 2012 8 次提交
  10. 23 2月, 2012 1 次提交
    • C
      Btrfs: make sure we update latest_bdev · a6b0d5c8
      Chris Mason 提交于
      When we are setting up the mount, we close all the
      devices that were not actually part of the metadata we found.
      
      But, we don't make sure that one of those devices wasn't
      fs_devices->latest_bdev, which means we can do a use after free
      on the one we closed.
      
      This updates latest_bdev as it goes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6b0d5c8
  11. 15 2月, 2012 1 次提交
    • K
      btrfs: Sector Size check during Mount · 941b2ddf
      Keith Mannthey 提交于
      Gracefully fail when trying to mount a BTRFS file system that has a
      sectorsize smaller than PAGE_SIZE.
      
      On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
      then boot into a 64K PAGE_SIZE kernel.  Presently open_ctree fails in an
      endless loop and hangs the machine in this situation.
      
      My debugging has show this Sector size < Page size to be a non trivial
      situation and a graceful exit from the situation would be nice for the
      time being.
      Signed-off-by: NKeith Mannthey <kmannth@us.ibm.com>
      941b2ddf
  12. 27 1月, 2012 1 次提交
    • D
      btrfs: mask out gfp flags in releasepage · 0c4e538b
      David Sterba 提交于
      btree_releasepage is a callback and can be passed unknown gfp flags and then
      they may end up in kmem_cache_alloc called from alloc_extent_state, slab
      allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.
      
      This may happen when btrfs is mounted from a loop device, which masks out
      __GFP_IO flag. The check in try_release_extent_state
      
      3399                 if ((mask & GFP_NOFS) == GFP_NOFS)
      3400                         mask = GFP_NOFS;
      
      will not work and passes unfiltered flags further resulting in crash at
      mm/slab.c:2963
      
       [<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
       [<000000000024c810>] kmem_cache_alloc+0x204/0x294
       [<00000000001fd3c2>] mempool_alloc+0x52/0x170
       [<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
       [<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
       [<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
       [<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
       [<0000000000210d84>] shrink_page_list+0x6a0/0x724
       [<0000000000211394>] shrink_inactive_list+0x230/0x578
       [<0000000000211bb8>] shrink_list+0x6c/0x120
       [<0000000000211e4e>] shrink_zone+0x1e2/0x228
       [<0000000000211f24>] shrink_zones+0x90/0x254
       [<0000000000213410>] do_try_to_free_pages+0xac/0x420
       [<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
       [<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
       [<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0c4e538b
  13. 17 1月, 2012 4 次提交
    • I
      Btrfs: allow for canceling restriper · a7e99c69
      Ilya Dryomov 提交于
      Implement an ioctl for canceling restriper.  Currently we wait until
      relocation of the current block group is finished, in future this can be
      done by triggering a commit.  Balance item is deleted and no memory
      about the interrupted balance is kept.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a7e99c69
    • I
      Btrfs: allow for pausing restriper · 837d5b6e
      Ilya Dryomov 提交于
      Implement an ioctl for pausing restriper.  This pauses the relocation,
      but balance is still considered to be "in progress": balance item is
      not deleted, other volume operations cannot be started, etc.  If paused
      in the middle of profile changing operation we will continue making
      allocations with the target profile.
      
      Add a hook to close_ctree() to pause restriper and free its data
      structures on unmount.  (It's safe to unmount when restriper is in
      "paused" state, we will resume with the same parameters on the next
      mount)
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      837d5b6e
    • I
      Btrfs: recover balance on mount · 59641015
      Ilya Dryomov 提交于
      On mount, if balance item is found, resume balance in a separate
      kernel thread.
      
      Try to be smart to continue roughly where previous balance (or convert)
      was interrupted.  For chunk types that were being converted to some
      profile we turn on soft convert, in case of a simple balance we turn on
      usage filter and relocate only less-than-90%-full chunks of that type.
      These are just heuristics but they help quite a bit, and can be improved
      in future.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      59641015
    • I
      Btrfs: add basic restriper infrastructure · c9e9f97b
      Ilya Dryomov 提交于
      Add basic restriper infrastructure: extended balancing ioctl and all
      related ioctl data structures, add data structure for tracking
      restriper's state to fs_info, etc.  The semantics of the old balancing
      ioctl are fully preserved.
      
      Explicitly disallow any volume operations when balance is in progress.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c9e9f97b