1. 20 1月, 2013 2 次提交
    • I
      Btrfs: fix "mutually exclusive op is running" error code · 2c0c9da0
      Ilya Dryomov 提交于
      The error code that is returned in response to starting a mutually
      exclusive operation when there is one already running got silently
      changed from EINVAL to EINPROGRESS by 5ac00add.  Returning EINPROGRESS
      to, say, add_dev, when rm_dev is running is misleading.  Furthermore,
      the operation itself may want to use EINPROGRESS for other purposes.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      2c0c9da0
    • I
      Btrfs: bring back balance pause/resume logic · ed0fb78f
      Ilya Dryomov 提交于
      Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
      as part of dev-replace merge).  Offending commit took a stab at making
      mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
      replace_dev) not block behind volume_mutex if another such operation is
      in progress and instead return an error right away.  Balancing front-end
      relied on the blocking behaviour, so the fix is ugly, but short of a
      complete rework, it's the best we can do.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      ed0fb78f
  2. 15 1月, 2013 4 次提交
  3. 18 12月, 2012 2 次提交
    • L
      Btrfs: fix a bug of per-file nocow · 213490b3
      Liu Bo 提交于
      Users report a bug, the reproducer is:
      $ mkfs.btrfs /dev/loop0
      $ mount /dev/loop0 /mnt/btrfs/
      $ mkdir /mnt/btrfs/dir
      $ chattr +C /mnt/btrfs/dir/
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
      $ lsattr /mnt/btrfs/dir/foo
      ---------------C- /mnt/btrfs/dir/foo
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 1 extent found    ---> an extent
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts
      
      The new created file should not only inherit the NODATACOW flag, but also
      honor NODATASUM flag, because we must do COW on a file extent with checksum.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      213490b3
    • C
      Btrfs: fix hash overflow handling · 9c52057c
      Chris Mason 提交于
      The handling for directory crc hash overflows was fairly obscure,
      split_leaf returns EOVERFLOW when we try to extend the item and that is
      supposed to bubble up to userland.  For a while it did so, but along the
      way we added better handling of errors and forced the FS readonly if we
      hit IO errors during the directory insertion.
      
      Along the way, we started testing only for EEXIST and the EOVERFLOW case
      was dropped.  The end result is that we may force the FS readonly if we
      catch a directory hash bucket overflow.
      
      This fixes a few problem spots.  First I add tests for EOVERFLOW in the
      places where we can safely just return the error up the chain.
      
      btrfs_rename is harder though, because it tries to insert the new
      directory item only after it has already unlinked anything the rename
      was going to overwrite.  Rather than adding very complex logic, I added
      a helper to test for the hash overflow case early while it is still safe
      to bail out.
      
      Snapshot and subvolume creation had a similar problem, so they are using
      the new helper now too.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NPascal Junod <pascal@junod.info>
      9c52057c
  4. 17 12月, 2012 8 次提交
  5. 13 12月, 2012 5 次提交
  6. 26 10月, 2012 2 次提交
  7. 12 10月, 2012 2 次提交
  8. 09 10月, 2012 2 次提交
    • S
      Btrfs: make filesystem read-only when submitting barrier fails · 5af3e8cc
      Stefan Behrens 提交于
      So far the return code of barrier_all_devices() is ignored, which
      means that errors are ignored. The result can be a corrupt
      filesystem which is not consistent.
      This commit adds code to evaluate the return code of
      barrier_all_devices(). The normal btrfs_error() mechanism is used to
      switch the filesystem into read-only mode when errors are detected.
      
      In order to decide whether barrier_all_devices() should return
      error or success, the number of disks that are allowed to fail the
      barrier submission is calculated. This calculation accounts for the
      worst RAID level of metadata, system and data. If single, dup or
      RAID0 is in use, a single disk error is already considered to be
      fatal. Otherwise a single disk error is tolerated.
      
      The calculation of the number of disks that are tolerated to fail
      the barrier operation is performed when the filesystem gets mounted,
      when a balance operation is started and finished, and when devices
      are added or removed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      5af3e8cc
    • L
      Btrfs: fix off-by-one in file clone · aa42ffd9
      Liu Bo 提交于
      Btrfs uses inclusive range end for lock_extent(), unlock_extent() and
      related functions, so we made off-by-one errors in file clone.
      
      This fixes it and also fixes some style problems.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      aa42ffd9
  9. 04 10月, 2012 1 次提交
    • D
      btrfs: allow setting NOCOW for a zero sized file via ioctl · 7e97b8da
      David Sterba 提交于
      Hi,
      
      the patch si simple, but it has user visible impact and I'm not quite sure how
      to resolve it.
      
      In short, $subj says it, chattr -C supports it and we want to use it.
      
      The conditions that acutally allow to change the NOCOW flag are clear. What if
      I try to set the flag on a file that is not empty? Options:
      
      1) whole ioctl will fail, EINVAL
      2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
           will stay COW-ed and checksummed
      2.2) ioctl will succeed, flag will not be removed and a syslog message will
           warn that the COW flag has not been changed
      2.2.1) dtto, no syslog message
      
      Man page of chattr states that
      
       "If it is set on a file which already has data blocks, it is undefined when
       the blocks assigned to the file will be fully stable."
      
      Yes, it's undefined and with current implementation it'll never happen. So from
      this end, the user cannot expect anything. I'm trying to find a reasonable
      behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
      flags in a deep directory does not fail unnecessarily and does not pollute the
      log.
      
      My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
      the fact that I know the code and otherwise would look there before consulting
      the documentation.
      
      The patch implements 2.2.1.
      
      david
      
      -------------8<-------------------
      From: David Sterba <dsterba@suse.cz>
      
      It's safe to turn off checksums for a zero sized file.
      
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
      
      "We cannot switch on NODATASUM for a file that already has extents that
      are checksummed. The invariant here is that either all the extents or
      none are checksummed.
      
      Theoretically it's possible to add/remove all checksums from a given
      file, but it's a potentially longtime operation, the file has to be in
      some intermediate state where the checksums partially exist but have to
      be ignored (for the csum->nocsum) until the file is fully converted,
      this brings more special cases to extent handling, it has to survive
      power failure and remain consistent, and probably needs to be restarted
      after next mount."
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      7e97b8da
  10. 02 10月, 2012 8 次提交
    • L
      Btrfs: use larger limit for translation of logical to inode · 425d17a2
      Liu Bo 提交于
      This is the change of the kernel side.
      
      Translation of logical to inode used to have an upper limit 4k on
      inode container's size, but the limit is not large enough for a data
      with a great many of refs, so when resolving logical address,
      we can end up with
      "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
      
      This changes to regard 64k as the upper limit and use vmalloc instead of
      kmalloc to get memory more easily.
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      425d17a2
    • L
      Btrfs: use helper for logical resolve · df031f07
      Liu Bo 提交于
      We already have a helper, iterate_inodes_from_logical(), for logical resolve,
      so just use it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      df031f07
    • L
      Btrfs: fix a bug in parsing return value in logical resolve · 69917e43
      Liu Bo 提交于
      In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
      
      It is possible to lose our errors because
      (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
      
      I'm not sure if it is on purpose, it just looks too hacky if it is.
      I'd rather use a real flag and a 'ret' to catch errors.
      Acked-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NLiu Bo <liub.liubo@gmail.com>
      69917e43
    • L
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo 提交于
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • M
      Btrfs: fix wrong size for the reservation of the, snapshot creation · 48c03c4b
      Miao Xie 提交于
      We should insert/update 6 items(root ref, root backref, dir item, dir index,
      root item and parent inode) when creating a snapshot, not 5 items, fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      48c03c4b
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
  11. 27 9月, 2012 3 次提交
  12. 21 9月, 2012 1 次提交