1. 17 1月, 2011 9 次提交
    • M
      btrfs: fix wrong free space information of btrfs · 6d07bcec
      Miao Xie 提交于
      When we store data by raid profile in btrfs with two or more different size
      disks, df command shows there is some free space in the filesystem, but the
      user can not write any data in fact, df command shows the wrong free space
      information of btrfs.
      
       # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 28.00KB
       	devid    1 size 5.01GB used 2.03GB path /dev/sda9
       	devid    2 size 10.00GB used 2.01GB path /dev/sda10
       # btrfs device scan /dev/sda9 /dev/sda10
       # mount /dev/sda9 /mnt
       # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
         (fill the filesystem)
       # sync
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 3.99GB
       	devid    1 size 5.01GB used 5.01GB path /dev/sda9
       	devid    2 size 10.00GB used 4.99GB path /dev/sda10
      
      It is because btrfs cannot allocate chunks when one of the pairing disks has
      no space, the free space on the other disks can not be used for ever, and should
      be subtracted from the total space, but btrfs doesn't subtract this space from
      the total. It is strange to the user.
      
      This patch fixes it by calcing the free space that can be used to allocate
      chunks.
      
      Implementation:
      1. get all the devices free space, and align them by stripe length.
      2. sort the devices by the free space.
      3. check the free space of the devices,
         3.1. if it is not zero, and then check the number of the devices that has
              more free space than this device,
              if the number of the devices is beyond the min stripe number, the free
              space can be used, and add into total free space.
              if the number of the devices is below the min stripe number, we can not
              use the free space, the check ends.
         3.2. if the free space is zero, check the next devices, goto 3.1
      
      This implementation is just likely fake chunk allocation.
      
      After appling this patch, df can show correct space information:
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6d07bcec
    • M
      btrfs: make the chunk allocator utilize the devices better · b2117a39
      Miao Xie 提交于
      With this patch, we change the handling method when we can not get enough free
      extents with default size.
      
      Implementation:
      1. Look up the suitable free extent on each device and keep the search result.
         If not find a suitable free extent, keep the max free extent
      2. If we get enough suitable free extents with default size, chunk allocation
         succeeds.
      3. If we can not get enough free extents, but the number of the extent with
         default size is >= min_stripes, we just change the mapping information
         (reduce the number of stripes in the extent map), and chunk allocation
         succeeds.
      4. If the number of the extent with default size is < min_stripes, sort the
         devices by its max free extent's size descending
      5. Use the size of the max free extent on the (num_stripes - 1)th device as the
         stripe size to allocate the device space
      
      By this way, the chunk allocator can allocate chunks as large as possible when
      the devices' space is not enough and make full use of the devices.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b2117a39
    • M
      btrfs: restructure find_free_dev_extent() · 7bfc837d
      Miao Xie 提交于
      - make it return the start position and length of the max free space when it can
        not find a suitable free space.
      - make it more readability
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7bfc837d
    • M
      btrfs: fix wrong calculation of stripe size · 1974a3b4
      Miao Xie 提交于
      There are two tiny problem:
      - One is When we check the chunk size is greater than the max chunk size or not,
        we should take mirrors into account, but the original code didn't.
      - The other is btrfs shouldn't use the size of the residual free space as the
        length of of a dup chunk when doing chunk allocation. It is because the device
        space that a dup chunk needs is twice as large as the chunk size, if we use
        the size of the residual free space as the length of a dup chunk, we can not
        get enough free space. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1974a3b4
    • M
      btrfs: try to reclaim some space when chunk allocation fails · d52a5b5f
      Miao Xie 提交于
      We cannot write data into files when when there is tiny space in the filesystem.
      
      Reproduce steps:
       # mkfs.btrfs /dev/sda1
       # mount /dev/sda1 /mnt
       # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
       # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
         (fill the filesystem)
       # umount /mnt
       # mount /dev/sda1 /mnt
       # rm -f /mnt/tmpfile0
       # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
         (failed with nospec)
      
      But if we do the last step again, we can write data successfully. The reason of
      the problem is that btrfs didn't try to commit the current transaction and
      reclaim some space when chunk allocation failed.
      
      This patch fixes it by committing the current transaction to reclaim some
      space when chunk allocation fails.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d52a5b5f
    • M
      btrfs: fix wrong data space statistics · 299a08b1
      Miao Xie 提交于
      Josef has implemented mixed data/metadata chunks, we must add those chunks'
      space just like data chunks.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      299a08b1
    • S
      fs/btrfs: Fix build of ctree · f580eb09
      Stefan Schmidt 提交于
      CC [M]  fs/btrfs/ctree.o
      In file included from fs/btrfs/ctree.c:21:0:
      fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
      fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
      make[2]: *** [fs/btrfs/ctree.o] Error 1
      make[1]: *** [fs/btrfs] Error 2
      make: *** [fs] Error 2
      
      We need to include kobject.h here.
      Reported-by: NJeff Garzik <jeff@garzik.org>
      Fix-suggested-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f580eb09
    • C
    • C
  2. 05 1月, 2011 1 次提交
  3. 23 12月, 2010 3 次提交
    • L
      Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls · 0caa102d
      Li Zefan 提交于
      This allows us to set a snapshot or a subvolume readonly or writable
      on the fly.
      
      Usage:
      
      Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
      call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);
      
      Changelog for v3:
      
      - Change to pass __u64 as ioctl parameter.
      
      Changelog for v2:
      
      - Add _GETFLAGS ioctl.
      - Check if the passed fd is the root of a subvolume.
      - Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      0caa102d
    • L
      Btrfs: Add readonly snapshots support · b83cc969
      Li Zefan 提交于
      Usage:
      
      Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
      ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
      
      Implementation:
      
      - Set readonly bit of btrfs_root_item->flags.
      - Add readonly checks in btrfs_permission (inode_permission),
      btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
      
      Changelog for v3:
      
      - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
      - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      b83cc969
    • L
      Btrfs: Refactor btrfs_ioctl_snap_create() · fa0d2b9b
      Li Zefan 提交于
      Split it into two functions for two different ioctls, since they
      share no common code.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      fa0d2b9b
  4. 22 12月, 2010 6 次提交
    • L
      btrfs: Extract duplicate decompress code · 3a39c18d
      Li Zefan 提交于
      Add a common function to copy decompressed data from working buffer
      to bio pages.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      3a39c18d
    • L
      btrfs: Allow to specify compress method when defrag · 1a419d85
      Li Zefan 提交于
      Update defrag ioctl, so one can choose lzo or zlib when turning
      on compression in defrag operation.
      
      Changelog:
      
      v1 -> v2
      - Add incompability flag.
      - Fix to check invalid compress type.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      1a419d85
    • L
      btrfs: Add lzo compression support · a6fa6fae
      Li Zefan 提交于
      Lzo is a much faster compression algorithm than gzib, so would allow
      more users to enable transparent compression, and some users can
      choose from compression ratio and speed for different applications
      
      Usage:
      
       # mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
      or
       # mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt
      
      "-o compress" without argument is still allowed for compatability.
      
      Compatibility:
      
      If we mount a filesystem with lzo compression, it will not be able be
      mounted in old kernels. One reason is, otherwise btrfs will directly
      dump compressed data, which sits in inline extent, to user.
      
      Performance:
      
      The test copied a linux source tarball (~400M) from an ext4 partition
      to the btrfs partition, and then extracted it.
      
      (time in second)
                 lzo        zlib        nocompress
      copy:      10.6       21.7        14.9
      extract:   70.1       94.4        66.6
      
      (data size in MB)
                 lzo        zlib        nocompress
      copy:      185.87     108.69      394.49
      extract:   193.80     132.36      381.21
      
      Changelog:
      
      v1 -> v2:
      - Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
      - Add incompability flag.
      - Fix error handling in compress code.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      a6fa6fae
    • L
      btrfs: Allow to add new compression algorithm · 261507a0
      Li Zefan 提交于
      Make the code aware of compression type, instead of always assuming
      zlib compression.
      
      Also make the zlib workspace function as common code for all
      compression types.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      261507a0
    • L
      btrfs: Fix error handling in zlib · 4b72029d
      Li Zefan 提交于
      Return failure if alloc_page() fails to allocate memory,
      and the upper code will just give up compression.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      4b72029d
    • L
      btrfs: Fix bugs in zlib workspace · 8844355d
      Li Zefan 提交于
      - Fix a race that can result in alloc_workspace > cpus.
      - Fix to check num_workspace after wakeup.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      8844355d
  5. 14 12月, 2010 3 次提交
    • C
      Btrfs: prevent RAID level downgrades when space is low · 83a50de9
      Chris Mason 提交于
      The extent allocator has code that allows us to fill
      allocations from any available block group, even if it doesn't
      match the raid level we've requested.
      
      This was put in because adding a new drive to a filesystem
      made with the default mkfs options actually upgrades the metadata from
      single spindle dup to full RAID1.
      
      But, the code also allows us to allocate from a raid0 chunk when we
      really want a raid1 or raid10 chunk.  This can cause big trouble because
      mkfs creates a small (4MB) raid0 chunk for data and metadata which then
      goes unused for raid1/raid10 installs.
      
      The allocator will happily wander in and allocate from that chunk when
      things get tight, which is not correct.
      
      The fix here is to make sure that we provide duplication when the
      caller has asked for it.  It does all the dups to be any raid level,
      which preserves the dup->raid1 upgrade abilities.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      83a50de9
    • C
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason 提交于
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      allocations.
      
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd02dca5
    • C
      Btrfs: EIO when we fail to read tree roots · 68433b73
      Chris Mason 提交于
      If we just get a plain IO error when we read tree roots, the code
      wasn't properly sending that error up the chain.  This allowed mounts to
      continue when they should failed, and allowed operations
      on partially setup root structs.  The end result was usually oopsen
      on spinlocks that hadn't been spun up correctly.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68433b73
  6. 11 12月, 2010 7 次提交
  7. 10 12月, 2010 4 次提交
  8. 29 11月, 2010 2 次提交
  9. 28 11月, 2010 5 次提交
    • J
      Btrfs: setup blank root and fs_info for mount time · 450ba0ea
      Josef Bacik 提交于
      There is a problem with how we use sget, it searches through the list of supers
      attached to the fs_type looking for a super with the same fs_devices as what
      we're trying to mount.  This depends on sb->s_fs_info being filled, but we don't
      fill that in until we get to btrfs_fill_super, so we could hit supers on the
      fs_type super list that have a null s_fs_info.  In order to fix that we need to
      go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
      way our test will work out right and then we can set s_fs_info in
      btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
      fs_info when setting everything up.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      450ba0ea
    • J
      Btrfs: fix fiemap · 975f84fe
      Josef Bacik 提交于
      There are two big problems currently with FIEMAP
      
      1) We return extents for holes.  This isn't supposed to happen, we just don't
      return extents for holes and then userspace interprets the lack of an extent as
      a hole.
      
      2) We sometimes don't set FIEMAP_EXTENT_LAST properly.  This is because we wait
      to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
      fiemap to map up to the last extent in a file, and there is nothing but holes up
      to the i_size.  To fix this we need to lookup the last extent in this file and
      save the logical offset, so if we happen to try and map that extent we can be
      sure to set FIEMAP_EXTENT_LAST.
      
      With this patch we now pass xfstest 225, which we never have before.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      975f84fe
    • I
      Btrfs - fix race between btrfs_get_sb() and umount · 619c8c76
      Ian Kent 提交于
      When mounting a btrfs file system btrfs_test_super() may attempt to
      use sb->s_fs_info, the btrfs root, of a super block that is going away
      and that has had the btrfs root set to NULL in its ->put_super(). But
      if the super block is going away it cannot be an existing super block
      so we can return false in this case.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      619c8c76
    • J
      Btrfs: update inode ctime when using links · bc1cbf1f
      Josef Bacik 提交于
      Currently we fail xfstest 236 because we're not updating the inode ctime on
      link.  This is a simple fix, and makes it so we pass 236 now.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bc1cbf1f
    • J
      Btrfs: make sure new inode size is ok in fallocate · 0ed42a63
      Josef Bacik 提交于
      We have been failing xfstest 228 forever, because we don't check to make sure
      the new inode size is acceptable as far as RLIMIT is concerned.  Just check to
      make sure it's ok to create a inode with this new size and error out if not.
      With this patch we now pass 228.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0ed42a63