1. 15 2月, 2012 2 次提交
    • J
      Btrfs: avoid positive number with ERR_PTR · 8f24b496
      Jan Schmidt 提交于
      inode_ref_info() returns 1 when the element wasn't found and < 0 on error,
      just like btrfs_search_slot(). In iref_to_path() it's an error when the
      inode ref can't be found, thus we return ERR_PTR(ret) in that case. In order
      to avoid ERR_PTR(1), we now set ret to -ENOENT in that case.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8f24b496
    • K
      btrfs: Sector Size check during Mount · 941b2ddf
      Keith Mannthey 提交于
      Gracefully fail when trying to mount a BTRFS file system that has a
      sectorsize smaller than PAGE_SIZE.
      
      On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
      then boot into a 64K PAGE_SIZE kernel.  Presently open_ctree fails in an
      endless loop and hangs the machine in this situation.
      
      My debugging has show this Sector size < Page size to be a non trivial
      situation and a graceful exit from the situation would be nice for the
      time being.
      Signed-off-by: NKeith Mannthey <kmannth@us.ibm.com>
      941b2ddf
  2. 01 2月, 2012 1 次提交
  3. 27 1月, 2012 11 次提交
    • C
      Btrfs: fix reservations in btrfs_page_mkwrite · 9998eb70
      Chris Mason 提交于
      Josef fixed btrfs_page_mkwrite to properly release reserved
      extents if there was an error.  But if we fail to get a reservation
      and we fail to dirty the inode (for ENOSPC reasons), we'll end up
      trying to release a reservation we never had.
      
      This makes sure we only release if we were able to reserve.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9998eb70
    • J
      Btrfs: advance window_start if we're using a bitmap · 9b230628
      Josef Bacik 提交于
      If we span a long area in a bitmap we could end up taking a lot of time
      searching to the next free area if we're searching from the original
      window_start, so advance window_start in order to make sure we don't do any
      superficial searching.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9b230628
    • D
      btrfs: mask out gfp flags in releasepage · 0c4e538b
      David Sterba 提交于
      btree_releasepage is a callback and can be passed unknown gfp flags and then
      they may end up in kmem_cache_alloc called from alloc_extent_state, slab
      allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.
      
      This may happen when btrfs is mounted from a loop device, which masks out
      __GFP_IO flag. The check in try_release_extent_state
      
      3399                 if ((mask & GFP_NOFS) == GFP_NOFS)
      3400                         mask = GFP_NOFS;
      
      will not work and passes unfiltered flags further resulting in crash at
      mm/slab.c:2963
      
       [<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
       [<000000000024c810>] kmem_cache_alloc+0x204/0x294
       [<00000000001fd3c2>] mempool_alloc+0x52/0x170
       [<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
       [<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
       [<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
       [<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
       [<0000000000210d84>] shrink_page_list+0x6a0/0x724
       [<0000000000211394>] shrink_inactive_list+0x230/0x578
       [<0000000000211bb8>] shrink_list+0x6c/0x120
       [<0000000000211e4e>] shrink_zone+0x1e2/0x228
       [<0000000000211f24>] shrink_zones+0x90/0x254
       [<0000000000213410>] do_try_to_free_pages+0xac/0x420
       [<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
       [<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
       [<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0c4e538b
    • M
      Btrfs: fix enospc error caused by wrong checks of the chunk · 9e622d6b
      Miao Xie 提交于
      When we did sysbench test for inline files, enospc error happened easily though
      there was lots of free disk space which could be allocated for new chunks.
      
      Reproduce steps:
       # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
       # mount <test partition> /mnt
       # ulimit -n 102400
       # cd /mnt
       # sysbench --num-threads=1 --test=fileio --file-num=81920 \
       > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
       > --file-test-mode=seqwr prepare
       # sysbench --num-threads=1 --test=fileio --file-num=81920 \
       > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
       > --file-test-mode=seqwr run
       <soon later, BUG_ON() was triggered by enospc error>
      
      The reason of this bug is:
      Now, we can reserve space which is larger than the free space in the chunks if
      we have enough free disk space which can be used for new chunks. By this way,
      the space allocator should allocate a new chunk by force if there is no free
      space in the free space cache. But there are two wrong checks which break this
      operation.
      
      One is
      	if (ret == -ENOSPC && num_bytes > min_alloc_size)
      in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
      even we fail to allocate free space by minimum allocable size.
      
      The other is
      	if (space_info->force_alloc)
      		force = space_info->force_alloc;
      in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
      sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
      
      Fix these two wrong checks. Especially the second one, we fix it by changing
      the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
      CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
      higher priority. And if the value which is passed in by the caller is greater
      than ->force_alloc, use the passed value.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e622d6b
    • L
      Btrfs: do not defrag a file partially · 7ec31b54
      Liu Bo 提交于
      xfstests 218 complains that btrfs defrags a file partially:
       After: 1
       Write backwards sync, but contiguous - should defrag to 1 extent
       Before: 10
      -After: 1
      +After: 2
      
      To fix this, we need to set max_to_defrag count properly.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ec31b54
    • S
      Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c · 0b485143
      Stefan Behrens 提交于
      There have been 4 warnings on 32-bit build, they are herewith fixed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0b485143
    • J
      Btrfs: use cluster->window_start when allocating from a cluster bitmap · 0b4a9d24
      Josef Bacik 提交于
      We specifically set window_start in the cluster struct to indicate where the
      cluster starts in a bitmap, but we've been using min_start to indicate where
      we're searching from.  This is usually the start of the blockgroup, so
      essentially means we're constantly searching from the start of any bitmap we
      find, which completely negates all the trouble we go to in order to setup a
      cluster.  So start using window_start to make sure we actually use the area we
      found.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0b4a9d24
    • M
      Btrfs: Check for NULL page in extent_range_uptodate · 8bedd51b
      Mitch Harder 提交于
      A user has encountered a NULL pointer kernel oops in btrfs when
      encountering media errors.  The problem has been identified
      as an unhandled NULL pointer returned from find_get_page().
      This modification simply checks for a NULL page, and returns
      with an error if found (the extent_range_uptodate() function
      returns 1 on errors).
      
      After testing this patch, the user reported that the error with
      the NULL pointer oops was solved.  However, there is still a
      remaining problem with a thread becoming stuck in
      wait_on_page_locked(page) in the read_extent_buffer_pages(...)
      function in extent_io.c
      
             for (i = start_i; i < num_pages; i++) {
                     page = extent_buffer_page(eb, i);
                     wait_on_page_locked(page);
                     if (!PageUptodate(page))
                             ret = -EIO;
             }
      
      This patch leaves the issue with the locked page yet to be resolved.
      Signed-off-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8bedd51b
    • J
      btrfs: Fix busyloops in transaction waiting code · 6dd70ce4
      Jan Kara 提交于
      wait_log_commit() and wait_for_writer() were using slightly different
      conditions for deciding whether they should call schedule() and whether they
      should continue in the wait loop. Thus it could happen that we busylooped when
      the first condition was not true while the second one was. That is burning CPU
      cycles needlessly and is deadly on UP machines...
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6dd70ce4
    • J
      Btrfs: make sure a bitmap has enough bytes · 357b9784
      Josef Bacik 提交于
      We have only been checking for min_bytes available in bitmap entries, but we
      won't successfully setup a bitmap cluster unless it has at least bytes in the
      bitmap, so in the common case min_bytes is 4k and we want something like 2MB, so
      if there are a bunch of bitmap entries with less than 2mb's in them, we'll
      search all them anyway, which is suboptimal.  Fix this check.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      357b9784
    • J
      Btrfs: fix uninit warning in backref.c · b1375d64
      Jan Schmidt 提交于
      Added initialization with the declaration of ret. It isn't set later on the
      switch-default branch (which should never be taken).
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b1375d64
  4. 17 1月, 2012 26 次提交
    • C
      Btrfs: use larger system chunks · 96bdc7dc
      Chris Mason 提交于
      system chunks by default are very small.  This makes them slightly
      larger and also fixes the conditional checks to make sure we don't
      allocate a billion of them at once.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      96bdc7dc
    • J
      Btrfs: add a delalloc mutex to inodes for delalloc reservations · f248679e
      Josef Bacik 提交于
      I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
      that and theres no real way to get rid of those, so just stop using i_mutex to
      protect delalloc metadata reservations and use a delalloc mutex instead.  This
      shouldn't be contended often at all, only if you are writing and mmap writing to
      the file at the same time.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f248679e
    • J
      Btrfs: space leak tracepoints · 8c2a3ca2
      Josef Bacik 提交于
      This in addition to a script in my btrfs-tracing tree will help track down space
      leaks when we're getting space left over in block groups on umount.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8c2a3ca2
    • J
      Btrfs: protect orphan block rsv with spin_lock · 90290e19
      Josef Bacik 提交于
      We've been seeing warnings coming out of the orphan commit stuff forever from
      ceph.  Turns out it's because we're racing with checking if the orphan block
      reserve is set, because we clear it outside of the spin_lock.  So leave the
      normal fastpath checks where they are, but take the spin_lock and _recheck_ to
      make sure we haven't had an orphan block rsv added in the meantime.  Then clear
      the root's orphan block rsv and release the lock.  With this patch a user said
      the warnings went away and they usually showed up pretty soon after he started
      ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      90290e19
    • J
      Btrfs: add allocator tracepoints · 3f7de037
      Josef Bacik 提交于
      I used these tracepoints when figuring out what the cluster stuff was doing, so
      add them to mainline in case we need to profile this stuff again.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3f7de037
    • J
      Btrfs: don't call btrfs_throttle in file write · 45a8090e
      Josef Bacik 提交于
      Btrfs_throttle will make us wait if there is a currently committing transaction
      until we can open new transactions, which is ridiculous since we don't actually
      start any transactions within the file write path anyway, so all this does is
      introduce big latencies if we have a sync/fsync heavy workload going on while
      somebody else is trying to do work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      45a8090e
    • J
      Btrfs: release space on error in page_mkwrite · ec39e180
      Josef Bacik 提交于
      If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
      which is a problem since we make our reservation right before trying to update
      the inode, so fix the out label so that we actually free our reservation.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ec39e180
    • M
      Btrfs: fix btrfsck error 400 when truncating a compressed · f70a9a6b
      Miao Xie 提交于
      Reproduce steps:
       # mkfs.btrfs /dev/sdb5
       # mount /dev/sdb5 -o compress=lzo /mnt
       # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
       # sync
       # truncate -s 64K /mnt/tmpfile
       root 5 inode 257 errors 400
      
      This is because of the wrong if condition, which is used to check if we should
      subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
      When we truncate a compressed extent, btrfs substracts the bytes of the whole
      extent, it's wrong. We should substract the real size that we truncate, no
      matter it is a compressed extent or not. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f70a9a6b
    • J
      Btrfs: do not use btrfs_end_transaction_throttle everywhere · 7ad85bb7
      Josef Bacik 提交于
      A user reported a problem where things like open with O_CREAT would take up to
      30 seconds when he had nfs activity on the same mount.  This is because all of
      our quick metadata operations, like create, symlink etc all do
      btrfs_end_transaction_throttle, which if the transaction is blocked will wait
      for the commit to complete before it returns.  This adds a ridiculous amount of
      latency and isn't really needed.  The normal btrfs_end_transaction will mark the
      transaction as blocked and wake the transaction kthread up if it thinks the
      transaction needs to end (this being in the running out of global reserve space
      scenario), and this is all that is really needed since we've already done
      everything we're going to do, we just need to return.  This should help people
      with the latency they were seeing when using synchronous heavy workloads.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ad85bb7
    • I
      Btrfs: add balance progress reporting · 19a39dce
      Ilya Dryomov 提交于
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      19a39dce
    • I
      Btrfs: allow for resuming restriper after it was paused · de322263
      Ilya Dryomov 提交于
      Recognize BTRFS_BALANCE_RESUME flag passed from userspace.  We use the
      same heuristics used when recovering balance after a crash to try to
      start where we left off last time.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      de322263
    • I
      Btrfs: allow for canceling restriper · a7e99c69
      Ilya Dryomov 提交于
      Implement an ioctl for canceling restriper.  Currently we wait until
      relocation of the current block group is finished, in future this can be
      done by triggering a commit.  Balance item is deleted and no memory
      about the interrupted balance is kept.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a7e99c69
    • I
      Btrfs: allow for pausing restriper · 837d5b6e
      Ilya Dryomov 提交于
      Implement an ioctl for pausing restriper.  This pauses the relocation,
      but balance is still considered to be "in progress": balance item is
      not deleted, other volume operations cannot be started, etc.  If paused
      in the middle of profile changing operation we will continue making
      allocations with the target profile.
      
      Add a hook to close_ctree() to pause restriper and free its data
      structures on unmount.  (It's safe to unmount when restriper is in
      "paused" state, we will resume with the same parameters on the next
      mount)
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      837d5b6e
    • I
      Btrfs: add skip_balance mount option · 9555c6c1
      Ilya Dryomov 提交于
      Since restriper kthread starts involuntarily on mount and can suck cpu
      and memory bandwidth add a mount option to forcefully skip it.  The
      restriper in that case hangs around in paused state and can be resumed
      from userspace when it's convenient.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9555c6c1
    • I
      Btrfs: recover balance on mount · 59641015
      Ilya Dryomov 提交于
      On mount, if balance item is found, resume balance in a separate
      kernel thread.
      
      Try to be smart to continue roughly where previous balance (or convert)
      was interrupted.  For chunk types that were being converted to some
      profile we turn on soft convert, in case of a simple balance we turn on
      usage filter and relocate only less-than-90%-full chunks of that type.
      These are just heuristics but they help quite a bit, and can be improved
      in future.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      59641015
    • I
      Btrfs: save balance parameters to disk · 0940ebf6
      Ilya Dryomov 提交于
      Introduce a new btree objectid for storing balance item.  The reason is
      to be able to resume restriper after a crash with the same parameters.
      Balance item has a very high objectid and goes into tree of tree roots.
      
      The key for the new item is as follows:
      
      	[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]
      
      Older kernels simply ignore it so it's safe to mount with an older
      kernel and then go back to the newer one.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0940ebf6
    • I
      Btrfs: soft profile changing mode (aka soft convert) · cfa4c961
      Ilya Dryomov 提交于
      When doing convert from one profile to another if soft mode is on
      restriper won't touch chunks that already have the profile we are
      converting to.  This is useful if e.g. half of the FS was converted
      earlier.
      
      The soft mode switch is (like every other filter) per-type.  This means
      that we can convert for example meta chunks the "hard" way while
      converting data chunks selectively with soft switch.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      cfa4c961
    • I
      Btrfs: implement online profile changing · e4d8ec0f
      Ilya Dryomov 提交于
      Profile changing is done by launching a balance with
      BTRFS_BALANCE_CONVERT bits set and target fields of respective
      btrfs_balance_args structs initialized.  Profile reducing code in this
      case will pick restriper's target profile if it's available instead of
      doing a blind reduce.  If target profile is not yet available it goes
      back to a plain reduce.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      e4d8ec0f
    • I
      Btrfs: do not reduce profile in do_chunk_alloc() · 70922617
      Ilya Dryomov 提交于
      Every caller of do_chunk_alloc() feeds it the reduced allocation
      profile, so stop trying to reduce it one more time.  Instead check the
      validity of the passed profile.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      70922617
    • I
      Btrfs: virtual address space subset filter · ea67176a
      Ilya Dryomov 提交于
      Select chunks which have at least one byte located inside a given
      [vstart, vend) virtual address space range.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      ea67176a
    • I
      Btrfs: devid subset filter · 94e60d5a
      Ilya Dryomov 提交于
      Select chunks which have at least one byte of at least one stripe
      located on a device with devid X in a given [pstart,pend) physical
      address range.
      
      This filter only works when devid filter is turned on.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      94e60d5a
    • I
      Btrfs: devid filter · 409d404b
      Ilya Dryomov 提交于
      Relocate chunks which have at least one stripe located on a device with
      devid X.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      409d404b
    • I
      Btrfs: usage filter · 5ce5b3c0
      Ilya Dryomov 提交于
      Select chunks that are less than X percent full.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5ce5b3c0
    • I
      Btrfs: profiles filter · ed25e9b2
      Ilya Dryomov 提交于
      Select chunks based on a given profile mask.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      ed25e9b2
    • I
      Btrfs: add basic infrastructure for selective balancing · f43ffb60
      Ilya Dryomov 提交于
      This allows to have a separate set of filters for each chunk type
      (data,meta,sys).  The code however is generic and switch on chunk type
      is only done once.
      
      This commit also adds a type filter: it allows to balance for example
      meta and system chunks w/o touching data ones.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f43ffb60
    • I
      Btrfs: add basic restriper infrastructure · c9e9f97b
      Ilya Dryomov 提交于
      Add basic restriper infrastructure: extended balancing ioctl and all
      related ioctl data structures, add data structure for tracking
      restriper's state to fs_info, etc.  The semantics of the old balancing
      ioctl are fully preserved.
      
      Explicitly disallow any volume operations when balance is in progress.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c9e9f97b