1. 12 3月, 2011 1 次提交
    • C
      Btrfs: break out of shrink_delalloc earlier · 36e39c40
      Chris Mason 提交于
      Josef had changed shrink_delalloc to exit after three shrink
      attempts, which wasn't quite enough because new writers could
      race in and steal free space.
      
      But it also fixed deadlocks and stalls as we tried to recover
      delalloc reservations.  The code was tweaked to loop 1024
      times, and would reset the counter any time a small amount
      of progress was made.  This was too drastic, and with a
      lot of writers we can end up stuck in shrink_delalloc forever.
      
      The shrink_delalloc loop is fairly complex because the caller is looping
      too, and the caller will go ahead and force a transaction commit to make
      sure we reclaim space.
      
      This reworks things to exit shrink_delalloc when we've forced some
      writeback and the delalloc reservations have gone down.  This means
      the writeback has not just started but has also finished at
      least some of the metadata changes required to reclaim delalloc
      space.
      
      If we've got this wrong, we're returning ENOSPC too early, which
      is a big improvement over the current behavior of hanging the machine.
      
      Test 224 in xfstests hammers on this nicely, and with 1000 writers
      trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
      other writers are able to continue until we get 100%.
      
      This is a worst case test for btrfs because the 1000 writers are doing
      small IO, and the small FS size means we don't have a lot of room
      for metadata chunks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      36e39c40
  2. 17 2月, 2011 2 次提交
    • C
      Btrfs: allow balance to explicitly allocate chunks as it relocates · c87f08ca
      Chris Mason 提交于
      Btrfs device shrinking and balancing ends up reallocating all the blocks
      in order to allow COW to move them to new destinations.  It is somewhat
      awkward in terms of ENOSPC because most of the enospc code is built
      around the idea that some operation on a reference counted tree triggers
      allocations in the non-reference counted trees.
      
      This commit changes the balancing code to deal with enospc by trying to
      allocate a new chunk.  If that allocation succeeds, we go ahead and
      retry whatever failed due to enospc.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c87f08ca
    • C
      Btrfs: put ENOSPC debugging under a mount option · 91435650
      Chris Mason 提交于
      ENOSPC in btrfs is getting to the point where the extra debugging isn't
      required.  I've put it under mount -o enospc_debug just in case someone
      is having difficult problems.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      91435650
  3. 18 1月, 2011 1 次提交
    • L
      Btrfs: forced readonly mounts on errors · acce952b
      liubo 提交于
      This patch comes from "Forced readonly mounts on errors" ideas.
      
      As we know, this is the first step in being more fault tolerant of disk
      corruptions instead of just using BUG() statements.
      
      The major content:
      - add a framework for generating errors that should result in filesystems
        going readonly.
      - keep FS state in disk super block.
      - make sure that all of resource will be freed and released at umount time.
      - make sure that fter FS is forced readonly on error, there will be no more
        disk change before FS is corrected. For this, we should stop write operation.
      
      After this patch is applied, the conversion from BUG() to such a framework can
      happen incrementally.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      acce952b
  4. 17 1月, 2011 3 次提交
    • S
      fs/btrfs: Fix build of ctree · f8b18087
      Stefan Schmidt 提交于
      Fix the build failure in some configurations:
      
           CC [M]  fs/btrfs/ctree.o
        In file included from fs/btrfs/ctree.c:21:0:
        fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
        fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
        make[2]: *** [fs/btrfs/ctree.o] Error 1
        make[1]: *** [fs/btrfs] Error 2
        make: *** [fs] Error 2
      
      caused by commit 57cc7215 ("headers: kobject.h redux")
      
      We need to include kobject.h here.
      Reported-by: NJeff Garzik <jeff@garzik.org>
      Fix-suggested-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8b18087
    • M
      btrfs: fix wrong free space information of btrfs · 6d07bcec
      Miao Xie 提交于
      When we store data by raid profile in btrfs with two or more different size
      disks, df command shows there is some free space in the filesystem, but the
      user can not write any data in fact, df command shows the wrong free space
      information of btrfs.
      
       # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 28.00KB
       	devid    1 size 5.01GB used 2.03GB path /dev/sda9
       	devid    2 size 10.00GB used 2.01GB path /dev/sda10
       # btrfs device scan /dev/sda9 /dev/sda10
       # mount /dev/sda9 /mnt
       # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
         (fill the filesystem)
       # sync
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 3.99GB
       	devid    1 size 5.01GB used 5.01GB path /dev/sda9
       	devid    2 size 10.00GB used 4.99GB path /dev/sda10
      
      It is because btrfs cannot allocate chunks when one of the pairing disks has
      no space, the free space on the other disks can not be used for ever, and should
      be subtracted from the total space, but btrfs doesn't subtract this space from
      the total. It is strange to the user.
      
      This patch fixes it by calcing the free space that can be used to allocate
      chunks.
      
      Implementation:
      1. get all the devices free space, and align them by stripe length.
      2. sort the devices by the free space.
      3. check the free space of the devices,
         3.1. if it is not zero, and then check the number of the devices that has
              more free space than this device,
              if the number of the devices is beyond the min stripe number, the free
              space can be used, and add into total free space.
              if the number of the devices is below the min stripe number, we can not
              use the free space, the check ends.
         3.2. if the free space is zero, check the next devices, goto 3.1
      
      This implementation is just likely fake chunk allocation.
      
      After appling this patch, df can show correct space information:
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6d07bcec
    • S
      fs/btrfs: Fix build of ctree · f580eb09
      Stefan Schmidt 提交于
      CC [M]  fs/btrfs/ctree.o
      In file included from fs/btrfs/ctree.c:21:0:
      fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
      fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
      make[2]: *** [fs/btrfs/ctree.o] Error 1
      make[1]: *** [fs/btrfs] Error 2
      make: *** [fs] Error 2
      
      We need to include kobject.h here.
      Reported-by: NJeff Garzik <jeff@garzik.org>
      Fix-suggested-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f580eb09
  5. 07 1月, 2011 1 次提交
  6. 23 12月, 2010 1 次提交
    • L
      Btrfs: Add readonly snapshots support · b83cc969
      Li Zefan 提交于
      Usage:
      
      Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
      ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
      
      Implementation:
      
      - Set readonly bit of btrfs_root_item->flags.
      - Add readonly checks in btrfs_permission (inode_permission),
      btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
      
      Changelog for v3:
      
      - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
      - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      b83cc969
  7. 22 12月, 2010 2 次提交
    • L
      btrfs: Add lzo compression support · a6fa6fae
      Li Zefan 提交于
      Lzo is a much faster compression algorithm than gzib, so would allow
      more users to enable transparent compression, and some users can
      choose from compression ratio and speed for different applications
      
      Usage:
      
       # mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
      or
       # mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt
      
      "-o compress" without argument is still allowed for compatability.
      
      Compatibility:
      
      If we mount a filesystem with lzo compression, it will not be able be
      mounted in old kernels. One reason is, otherwise btrfs will directly
      dump compressed data, which sits in inline extent, to user.
      
      Performance:
      
      The test copied a linux source tarball (~400M) from an ext4 partition
      to the btrfs partition, and then extracted it.
      
      (time in second)
                 lzo        zlib        nocompress
      copy:      10.6       21.7        14.9
      extract:   70.1       94.4        66.6
      
      (data size in MB)
                 lzo        zlib        nocompress
      copy:      185.87     108.69      394.49
      extract:   193.80     132.36      381.21
      
      Changelog:
      
      v1 -> v2:
      - Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
      - Add incompability flag.
      - Fix error handling in compress code.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      a6fa6fae
    • L
      btrfs: Allow to add new compression algorithm · 261507a0
      Li Zefan 提交于
      Make the code aware of compression type, instead of always assuming
      zlib compression.
      
      Also make the zlib workspace function as common code for all
      compression types.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      261507a0
  8. 22 11月, 2010 1 次提交
  9. 30 10月, 2010 2 次提交
    • S
      Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed · 4260f7c7
      Sage Weil 提交于
      Add a mount option user_subvol_rm_allowed that allows users to delete a
      (potentially non-empty!) subvol when they would otherwise we allowed to do
      an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
      to implement identical security checks (minus the directory size check).
      We additionally require that the user has write+exec permission on the
      subvol root inode.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4260f7c7
    • S
      Btrfs: async transaction commit · bb9c12c9
      Sage Weil 提交于
      Add support for an async transaction commit that is ordered such that any
      subsequent operations will join the following transaction, but does not
      wait until the current commit is fully on disk.  This avoids much of the
      latency associated with the btrfs_commit_transaction for callers concerned
      with serialization and not safety.
      
      The wait_for_unblock flag controls whether we wait for the 'middle' portion
      of commit_transaction to complete, which is necessary if the caller expects
      some of the modifications contained in the commit to be available (this is
      the case for subvol/snapshot creation).
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bb9c12c9
  10. 29 10月, 2010 4 次提交
    • J
      Btrfs: Add a clear_cache mount option · 88c2ba3b
      Josef Bacik 提交于
      If something goes wrong with the free space cache we need a way to make sure
      it's not loaded on mount and that it's cleared for everybody.  When you pass the
      clear_cache option it will make it so all block groups are setup to be cleared,
      which keeps them from being loaded and then they will be truncated when the
      transaction is committed.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      88c2ba3b
    • J
      Btrfs: add support for mixed data+metadata block groups · 67377734
      Josef Bacik 提交于
      There are just a few things that need to be fixed in the kernel to support mixed
      data+metadata block groups.  Mostly we just need to make sure that if we are
      using mixed block groups that we continue to allocate mixed block groups as we
      need them.  Also we need to make sure __find_space_info will find our space info
      if we search for DATA or METADATA only.  Tested this with xfstests and it works
      nicely.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      67377734
    • J
      Btrfs: write out free space cache · 0cb59c99
      Josef Bacik 提交于
      This is a simple bit, just dump the free space cache out to our preallocated
      inode when we're writing out dirty block groups.  There are a bunch of changes
      in inode.c in order to account for special cases.  Mostly when we're doing the
      writeout we're holding trans_mutex, so we need to use the nolock transacation
      functions.  Also we can't do asynchronous completions since the async thread
      could be blocked on already completed IO waiting for the transaction lock.  This
      has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
      tests.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0cb59c99
    • J
      Btrfs: create special free space cache inode · 0af3d00b
      Josef Bacik 提交于
      In order to save free space cache, we need an inode to hold the data, and we
      need a special item to point at the right inode for the right block group.  So
      first, create a special item that will point to the right inode, and the number
      of extent entries we will have and the number of bitmaps we will have.  We
      truncate and pre-allocate space everytime to make sure it's uptodate.
      
      This feature will be turned on as soon as you mount with -o space_cache, however
      it is safe to boot into old kernels, they will just generate the cache the old
      fashion way.  When you boot back into a newer kernel we will notice that we
      modified and not the cache and automatically discard the cache.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0af3d00b
  11. 23 10月, 2010 3 次提交
    • J
      Btrfs: rework how we reserve metadata bytes · 8bb8ab2e
      Josef Bacik 提交于
      With multi-threaded writes we were getting ENOSPC early because somebody would
      come in, start flushing delalloc because they couldn't make their reservation,
      and in the meantime other threads would come in and use the space that was
      getting freed up, so when the original thread went to check to see if they had
      space they didn't and they'd return ENOSPC.  So instead if we have some free
      space but not enough for our reservation, take the reservation and then start
      doing the flushing.  The only time we don't take reservations is when we've
      already overcommitted our space, that way we don't have people who come late to
      the party way overcommitting ourselves.  This also moves all of the retrying and
      flushing code into reserve_metdata_bytes so it's all uniform.  This keeps my
      fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
      fill up the disk.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8bb8ab2e
    • J
      Btrfs: re-work delalloc flushing · 0019f10d
      Josef Bacik 提交于
      Currently we try and flush delalloc, but we only do that in a sort of weak way,
      which works fine in most cases but if we're under heavy pressure we need to be
      able to wait for flushing to happen.  Also instead of checking the bytes
      reserved in the block_rsv, check the space info since it is more accurate.  The
      sync option will be used in a future patch.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0019f10d
    • J
      Btrfs: fix df regression · 89a55897
      Josef Bacik 提交于
      The new ENOSPC stuff breaks out the raid types which breaks the way we were
      reporting df to the system.  This fixes it back so that Available is the total
      space available to data and used is the actual bytes used by the filesystem.
      This means that Available is Total - data used - all of the metadata space.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      89a55897
  12. 10 8月, 2010 2 次提交
  13. 28 5月, 2010 1 次提交
  14. 25 5月, 2010 11 次提交
  15. 31 3月, 2010 1 次提交
  16. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  17. 15 3月, 2010 3 次提交
    • A
      btrfs: use memparse · 91748467
      Akinobu Mita 提交于
      Use memparse() instead of its own private implementation.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      91748467
    • J
      Btrfs: cache the extent state everywhere we possibly can V2 · 2ac55d41
      Josef Bacik 提交于
      This patch just goes through and fixes everybody that does
      
      lock_extent()
      blah
      unlock_extent()
      
      to use
      
      lock_extent_bits()
      blah
      unlock_extent_cached()
      
      and pass around a extent_state so we only have to do the searches once per
      function.  This gives me about a 3 mb/s boots on my random write test.  I have
      not converted some things, like the relocation and ioctl's, since they aren't
      heavily used and the relocation stuff is in the middle of being re-written.  I
      also changed the clear_extent_bit() to only unset the cached state if we are
      clearing EXTENT_LOCKED and related stuff, so we can do things like this
      
      lock_extent_bits()
      clear delalloc bits
      unlock_extent_cached()
      
      without losing our cached state.  I tested this thoroughly and turned on
      LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
      fine.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ac55d41
    • C
      Btrfs: add new defrag-range ioctl. · 1e701a32
      Chris Mason 提交于
      The btrfs defrag ioctl was limited to doing the entire file.  This
      commit adds a new interface that can defrag a specific range inside
      the file.
      
      It can also force compression on the file, allowing you to selectively
      compress individual files after they were created, even when mount -o
      compress isn't turned on.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1e701a32