1. 27 1月, 2012 1 次提交
    • D
      btrfs: mask out gfp flags in releasepage · 0c4e538b
      David Sterba 提交于
      btree_releasepage is a callback and can be passed unknown gfp flags and then
      they may end up in kmem_cache_alloc called from alloc_extent_state, slab
      allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.
      
      This may happen when btrfs is mounted from a loop device, which masks out
      __GFP_IO flag. The check in try_release_extent_state
      
      3399                 if ((mask & GFP_NOFS) == GFP_NOFS)
      3400                         mask = GFP_NOFS;
      
      will not work and passes unfiltered flags further resulting in crash at
      mm/slab.c:2963
      
       [<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
       [<000000000024c810>] kmem_cache_alloc+0x204/0x294
       [<00000000001fd3c2>] mempool_alloc+0x52/0x170
       [<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
       [<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
       [<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
       [<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
       [<0000000000210d84>] shrink_page_list+0x6a0/0x724
       [<0000000000211394>] shrink_inactive_list+0x230/0x578
       [<0000000000211bb8>] shrink_list+0x6c/0x120
       [<0000000000211e4e>] shrink_zone+0x1e2/0x228
       [<0000000000211f24>] shrink_zones+0x90/0x254
       [<0000000000213410>] do_try_to_free_pages+0xac/0x420
       [<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
       [<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
       [<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0c4e538b
  2. 17 1月, 2012 5 次提交
    • I
      Btrfs: allow for canceling restriper · a7e99c69
      Ilya Dryomov 提交于
      Implement an ioctl for canceling restriper.  Currently we wait until
      relocation of the current block group is finished, in future this can be
      done by triggering a commit.  Balance item is deleted and no memory
      about the interrupted balance is kept.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a7e99c69
    • I
      Btrfs: allow for pausing restriper · 837d5b6e
      Ilya Dryomov 提交于
      Implement an ioctl for pausing restriper.  This pauses the relocation,
      but balance is still considered to be "in progress": balance item is
      not deleted, other volume operations cannot be started, etc.  If paused
      in the middle of profile changing operation we will continue making
      allocations with the target profile.
      
      Add a hook to close_ctree() to pause restriper and free its data
      structures on unmount.  (It's safe to unmount when restriper is in
      "paused" state, we will resume with the same parameters on the next
      mount)
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      837d5b6e
    • I
      Btrfs: recover balance on mount · 59641015
      Ilya Dryomov 提交于
      On mount, if balance item is found, resume balance in a separate
      kernel thread.
      
      Try to be smart to continue roughly where previous balance (or convert)
      was interrupted.  For chunk types that were being converted to some
      profile we turn on soft convert, in case of a simple balance we turn on
      usage filter and relocate only less-than-90%-full chunks of that type.
      These are just heuristics but they help quite a bit, and can be improved
      in future.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      59641015
    • I
      Btrfs: add basic restriper infrastructure · c9e9f97b
      Ilya Dryomov 提交于
      Add basic restriper infrastructure: extended balancing ioctl and all
      related ioctl data structures, add data structure for tracking
      restriper's state to fs_info, etc.  The semantics of the old balancing
      ioctl are fully preserved.
      
      Explicitly disallow any volume operations when balance is in progress.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c9e9f97b
    • I
      Btrfs: get rid of *_alloc_profile fields · 6fef8df1
      Ilya Dryomov 提交于
      {data,metadata,system}_alloc_profile fields have been unused for a long
      time now.  Get rid of them.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6fef8df1
  3. 13 1月, 2012 2 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · b969c4ab
      Mel Gorman 提交于
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b969c4ab
  4. 11 1月, 2012 1 次提交
    • L
      Btrfs: fix possible deadlock when opening a seed device · b367e47f
      Li Zefan 提交于
      The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
      but when we mount a filesystem which has backing seed devices, we have
      this lock chain:
      
          open_ctree()
              lock(chunk_mutex);
              read_chunk_tree();
                  read_one_dev();
                      open_seed_devices();
                          lock(uuid_mutex);
      
      and then we hit a lockdep splat.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      b367e47f
  5. 09 1月, 2012 10 次提交
  6. 22 12月, 2011 2 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
    • S
      Btrfs: integrate integrity check module into btrfs · 21adbd5c
      Stefan Behrens 提交于
      This is the last part of the patch series. It modifies the btrfs
      code to use the integrity check module if configured to do so
      with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
      the only effective change is that code is added that handles the
      mount option to activate the integrity check. If the mount option is
      set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
      complains in the log and the mount fails with EINVAL.
      
      Add the mount option to activate the usage of the integrity check
      code.
      Add invocation of btrfs integrity check code init and cleanup
      function on mount and umount, respectively.
      Add hook to call btrfs integrity check code version of
      submit_bh/submit_bio.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      21adbd5c
  7. 16 12月, 2011 1 次提交
    • J
      Btrfs: fix num_workers_starting bug and other bugs in async thread · 0dc3b84a
      Josef Bacik 提交于
      Al pointed out we have some random problems with the way we account for
      num_workers_starting in the async thread stuff.  First of all we need to make
      sure to decrement num_workers_starting if we fail to start the worker, so make
      __btrfs_start_workers do this.  Also fix __btrfs_start_workers so that it
      doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
      failed to create a worker.  Also check_pending_worker_creates needs to call
      __btrfs_start_work in it's work function since it already increments
      num_workers_starting.
      
      People only start one worker at a time, so get rid of the num_workers argument
      everywhere, and make btrfs_queue_worker a void since it will always succeed.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0dc3b84a
  8. 22 11月, 2011 1 次提交
    • T
      freezer: unexport refrigerator() and update try_to_freeze() slightly · a0acae0e
      Tejun Heo 提交于
      There is no reason to export two functions for entering the
      refrigerator.  Calling refrigerator() instead of try_to_freeze()
      doesn't save anything noticeable or removes any race condition.
      
      * Rename refrigerator() to __refrigerator() and make it return bool
        indicating whether it scheduled out for freezing.
      
      * Update try_to_freeze() to return bool and relay the return value of
        __refrigerator() if freezing().
      
      * Convert all refrigerator() users to try_to_freeze().
      
      * Update documentation accordingly.
      
      * While at it, add might_sleep() to try_to_freeze().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Samuel Ortiz <samuel@sortiz.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Christoph Hellwig <hch@infradead.org>
      a0acae0e
  9. 20 11月, 2011 2 次提交
    • J
      btrfs: mirror_num should be int, not u64 · 32240a91
      Jan Schmidt 提交于
      My previous patch introduced some u64 for failed_mirror variables, this one
      makes it consistent again.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      32240a91
    • C
      Btrfs: fix barrier flushes · 387125fc
      Chris Mason 提交于
      When btrfs is writing the super blocks, it send barrier flushes to make
      sure writeback caching drives get all the metadata on disk in the
      right order.
      
      But, we have two bugs in the way these are sent down.  When doing
      full commits (not via the tree log), we are sending the barrier down
      before the last super when it should be going down before the first.
      
      In multi-device setups, we should be waiting for the barriers to
      complete on all devices before writing any of the supers.
      
      Both of these bugs can cause corruptions on power failures.  We fix it
      with some new code to send down empty barriers to all devices before
      writing the first super.
      
      Alexandre Oliva found the multi-device bug.  Arne Jansen did the async
      barrier loop.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      387125fc
  10. 10 11月, 2011 2 次提交
  11. 07 11月, 2011 1 次提交
  12. 06 11月, 2011 6 次提交
  13. 02 11月, 2011 1 次提交
  14. 20 10月, 2011 3 次提交
    • J
      Btrfs: allow us to overcommit our enospc reservations · 2bf64758
      Josef Bacik 提交于
      One of the things that kills us is the fact that our ENOSPC reservations are
      horribly over the top in most normal cases.  There isn't too much that can be
      done about this because when we are completely full we really need them to work
      like this so we don't under reserve.  However if there is plenty of unallocated
      chunks on the disk we can use that to gauge how much we can overcommit.  So this
      patch adds chunk free space accounting so we always know how much unallocated
      space we have.  Then if we fail to make a reservation within our allocated
      space, check to see if we can overcommit.  In the normal flushing case (like
      with delalloc metadata reservations) we'll take the free space and divide it by
      2 if our metadata profile is setup for DUP or any of those, and then divide it
      by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
      case (we really need this reservation now!) we only limit ourselves to half of
      the free space.  This makes this fio test
      
      [torrent]
      filename=torrent-test
      rw=randwrite
      size=4g
      ioengine=sync
      directory=/mnt/btrfs-test
      
      go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
      file system.  This doesn't seem to break my other enospc tests, but could really
      use some more testing as this is a super scary change.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2bf64758
    • J
      Btrfs: put the block group cache after we commit the super · 300e4f8a
      Josef Bacik 提交于
      In moving some enospc stuff around I noticed that when we unmount we are often
      evicting the free space cache inodes before we do our last commit.  This isn't
      bad, but it makes us constantly have to re-read the inodes back.  So instead
      don't evict the cache until after we do our last commit, this will make things a
      little less crappy and makes a future enospc change work properly.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      300e4f8a
    • J
      Btrfs: kill the durable block rsv stuff · 37be25bc
      Josef Bacik 提交于
      This is confusing code and isn't used by anything anymore, so delete it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      37be25bc
  15. 02 10月, 2011 2 次提交
    • A
      btrfs: hooks for readahead · 4bb31e92
      Arne Jansen 提交于
      This adds the hooks needed for readahead. In the readpage_end_io_hook,
      the extent state is checked for the EXTENT_READAHEAD flag. Only in this
      case the readahead hook is called, to keep the impact on non-ra as low
      as possible.
      Additionally, a hook for a failed IO is added, otherwise readahead would
      wait indefinitely for the extent to finish.
      
      Changes for v2:
       - eliminate race condition
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      4bb31e92
    • A
      btrfs: state information for readahead · 90519d66
      Arne Jansen 提交于
      Add state information for readahead to btrfs_fs_info and btrfs_device
      
      Changes v2:
       - don't wait in radix_trees
       - add own set of workers for readahead
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      90519d66