1. 12 11月, 2013 12 次提交
    • J
      Btrfs: return an error from btrfs_wait_ordered_range · 0ef8b726
      Josef Bacik 提交于
      I noticed that if the free space cache has an error writing out it's data it
      won't actually error out, it will just carry on.  This is because it doesn't
      check the return value of btrfs_wait_ordered_range, which didn't actually return
      anything.  So fix this in order to keep us from making free space cache look
      valid when it really isnt.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      0ef8b726
    • Z
      btrfs: remove fs/btrfs/compat.h · 8b558c5f
      Zach Brown 提交于
      fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
      and inc_nlink().  This doesn't belong in mainline.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8b558c5f
    • J
      Btrfs: handle a missing extent for the first file extent · 25a50341
      Josef Bacik 提交于
      While trying to kill our hole extents I noticed I was seeing problems where we
      seek into a file and then start writing and then try to fiemap that file later.
      This is because we search for offset 0, don't find anything and so back up one
      slot, which puts us at the inode ref or something like that, which means we goto
      not_found and create an extent map for our entire search area.  This isn't quite
      what we want, we want to move forward one slot and see if there is an extent
      there so we can limit our hole extent.  This patch fixes this problem, I will
      add a testcase for this as well.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      25a50341
    • J
      Btrfs: add tests for btrfs_get_extent · aaedb55b
      Josef Bacik 提交于
      I'm going to be removing hole extents in the near future so I wanted to make a
      sanity test for btrfs_get_extent to make sure I don't break anything in the
      meantime.  This patch just puts btrfs_get_extent through its paces by giving it
      a completely unreasonable mapping to look at and make sure it is giving us back
      maps that make sense.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      aaedb55b
    • J
      Btrfs: free reserved space on error in a few places · 857cc2fc
      Josef Bacik 提交于
      While trying to track down a reserved space leak I noticed a few places where we
      won't properly clean up reserved space if we have an error, this patch fixes
      those up.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      857cc2fc
    • F
      Btrfs: improve inode hash function/inode lookup · 778ba82b
      Filipe David Borba Manana 提交于
      Currently the hash value used for adding an inode to the VFS's inode
      hash table consists of the plain inode number, which is a 64 bits
      integer. This results in hash table buckets (hlist_head lists) with
      too many elements for at least 2 important scenarios:
      
      1) When we have many subvolumes. Each subvolume has its own btree
         where its files and directories are added to, and each has its
         own objectid (inode number) namespace. This means that if we have
         N subvolumes, and all have inode number X associated to a file or
         directory, the corresponding inodes all map to the same hash table
         entry, resulting in a bucket (hlist_head list) with N elements;
      
      2) On 32 bits machines. Th VFS hash values are unsigned longs, which
         are 32 bits wide on 32 bits machines, and the inode (objectid)
         numbers are 64 bits unsigned integers. We simply cast the inode
         numbers to hash values, which means that for all inodes with the
         same 32 bits lower half, the same hash bucket is used for all of
         them. For example, all inodes with a number (objectid) between
         0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
         the same hash table bucket.
      
      This change ensures the inode's hash value depends both on the
      objectid (inode number) and its subvolume's (btree root) objectid.
      For 32 bits machines, this change gives better entropy by making
      the hash value depend on both the upper and lower 32 bits of the
      64 bits hash previously computed.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      778ba82b
    • M
      Btrfs: improve jitter performance of the sequential buffered write · fa7c1494
      Miao Xie 提交于
      The performance was slowed down sometimes when we ran sysbench to measure
      the performance of the sequential buffered write by 2 or more threads.
      
      It was because the write order of the test threads might be confused
      by the task scheduler, and the coming write would be beyond the end of
      the file, in this case, we need insert dummy file extents and create
      a hole for the area we skip. But in order to avoid the ongoing ordered
      extents which are in the area, we need wait for them. Unfortunately,
      the current code doesn't check if there are ordered extents in the area
      or not, try to find and flush the dirty pages directly, but in fact,
      there is no dirty page in that area, this step of the current code is
      unnecessary, and just wastes time. Sometimes, it would increase
      the contention of some locks, and makes the performance slow down suddenly.
      
      So we remove the ordered extent flush function before the check, and flush
      the dirty pages and wait for the ordered extents only when we find them.
      
      According to my test, we got 1-2 times of the performance regression when
      we ran the test by 10 times before applying this patch. After applying
      this patch, the regression went away.
      
      Test Environment:
       CPU:		1CPU * 4Cores
       Memory:	6GB
       Partition:	20GB
      
      Test Command:
       # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
       > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      fa7c1494
    • J
      Btrfs: do not release metadata for space cache inodes · b6d08f06
      Josef Bacik 提交于
      I've been testing our error paths and I was tripping the BUG_ON() in
      drop_outstanding_extent because our outstanding_extents is 0 for space cache
      inodes.  This is because we don't reserve metadata space for these inodes since
      we depend on the global block reserve for our space.  To fix this we need to
      make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
      space cache inodes.  With this patch I'm no longer panicing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b6d08f06
    • F
      Btrfs: fix tracking of orphan inode count · 703c88e0
      Filipe David Borba Manana 提交于
      In inode.c:btrfs_orphan_add() if we failed to insert the orphan
      item, we would return without decrementing the orphan count that
      we just incremented before attempting the insertion, leaving the
      orphan inode count wrong.
      
      In inode.c:btrfs_orphan_del(), we were decrementing the inode
      orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
      which is logically wrong because it should be decremented if the
      bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
      the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      703c88e0
    • R
      btrfs: drop unused parameter from btrfs_item_nr · dd3cc16b
      Ross Kirk 提交于
      Remove unused eb parameter from btrfs_item_nr
      Signed-off-by: NRoss Kirk <ross.kirk@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      dd3cc16b
    • F
      Btrfs: don't store NULL byte in symlink extents · f06becc4
      Filipe David Borba Manana 提交于
      It is not necessary to store the NULL byte in a symlink inline file
      extent. There's currently no code that requires the NULL byte to be
      present in the extent. This change also doesn't break file format
      compatibility nor the send/receive feature.
      
      The VFS also doesn't need the NULL byte to be present in the extent,
      as it reads up to inode->i_size bytes (which already excluded the NULL
      byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()).
      
      So with this change we save 1 byte per symlink file extent (which is
      always inlined in the btree leaf) without losing backward and forward
      compatibility.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      f06becc4
    • S
      Btrfs: eliminate the exceptional root_tree refs=0 · 69e9c6c6
      Stefan Behrens 提交于
      The fact that btrfs_root_refs() returned 0 for the tree_root caused
      bugs in the past, therefore it is set to 1 with this patch and
      (hopefully) all affected code is adapted to this change.
      
      I verified this change by temporarily adding WARN_ON() checks
      everywhere where btrfs_root_refs() is used, checking whether the
      logic of the code is changed by btrfs_root_refs() returning 1
      instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
      With these added checks, I ran the xfstests './check -g auto'.
      
      The two roots chunk_root and log_root_tree that are only referenced
      by the superblock and the log_roots below the log_root_tree still
      have btrfs_root_refs() == 0, only the tree_root is changed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      69e9c6c6
  2. 19 10月, 2013 1 次提交
  3. 11 10月, 2013 1 次提交
  4. 21 9月, 2013 3 次提交
  5. 13 9月, 2013 1 次提交
  6. 01 9月, 2013 11 次提交
  7. 10 8月, 2013 3 次提交
    • Z
      btrfs: don't loop on large offsets in readdir · db62efbb
      Zach Brown 提交于
      When btrfs readdir() hits the last entry it sets the readdir offset to a
      huge value to stop buggy apps from breaking when the same name is
      returned by readdir() with concurrent rename()s.
      
      But unconditionally setting the offset to INT_MAX causes readdir() to
      loop returning any entries with offsets past INT_MAX.  It only takes a
      few hours of constant file creation and removal to create entries past
      INT_MAX.
      
      So let's set the huge offset to LLONG_MAX if the last entry has already
      overflowed 32bit loff_t.   Without large offsets behaviour is identical.
      With large offsets 64bit apps will work and 32bit apps will be no more
      broken than they currently are if they see large offsets.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      db62efbb
    • L
      Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents · e68afa49
      Liu Bo 提交于
      For partial extents, snapshot-aware defrag does not work as expected,
      since
      a) we use the wrong logical offset to search for parents, which should be
         disk_bytenr + extent_offset, not just disk_bytenr,
      b) 'offset' returned by the backref walking just refers to key.offset, not
         the 'offset' stored in btrfs_extent_data_ref which is
         (key.offset - extent_offset).
      
      The reproducer:
      $ mkfs.btrfs sda
      $ mount sda /mnt
      $ btrfs sub create /mnt/sub
      $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
      $ btrfs sub snap /mnt/sub /mnt/snap1
      $ btrfs sub snap /mnt/sub /mnt/snap2
      $ sync; btrfs filesystem defrag /mnt/sub/foo;
      $ umount /mnt
      $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
      
      This addresses the above two problems.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e68afa49
    • J
      btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified · 7cddc193
      Jie Liu 提交于
      Create a small file and fallocate it to a big size with
      FALLOC_FL_KEEP_SIZE option, then truncate it back to the
      small size again, the disk free space is not changed back
      in this case. i.e,
      
      total 4
      -rw-r--r-- 1 root root 512 Jun 28 11:35 test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      
      -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      With this fix, the truncated up space is back as:
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7cddc193
  8. 15 7月, 2013 1 次提交
  9. 02 7月, 2013 4 次提交
    • J
      Btrfs: wait ordered range before doing direct io · 0e267c44
      Josef Bacik 提交于
      My recent truncate patch uncovered this bug, but I can reproduce it without the
      truncate patch.  If you mount with -o compress-force, do a direct write to some
      area, do a buffered write to some other area, and then do a direct read you will
      get the wrong data for where you did the buffered write.  This is because the
      generic direct io helpers only call filemap_write_and_wait once, and for
      compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
      the start of the direct io function to make sure any compressed writes have
      truly been written.  This patch makes xfstests 130 pass when you mount with -o
      compress-force=lzo.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0e267c44
    • M
      e6da5d2e
    • J
      Btrfs: check if we can nocow if we don't have data space · 7ee9e440
      Josef Bacik 提交于
      We always just try and reserve data space when we write, but if we are out of
      space but have prealloc'ed extents we should still successfully write.  This
      patch will try and see if we can write to prealloc'ed space and if we can go
      ahead and allow the write to continue.  With this patch we now pass xfstests
      generic/274.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7ee9e440
    • J
      Btrfs: check for actual acls rather than just xattrs when caching no acl · f23b5a59
      Josef Bacik 提交于
      We have an optimization that will go ahead and cache no acls on an inode if
      there are no xattrs on the inode.  This saves us a lookup later to check the
      acls for writes or any other access.  The problem is I use selinux so I always
      have an xattr on inodes, so make this test a little smarter and check for the
      actual acl hash on the key and if it isn't there then we still get to cache no
      acl which makes everybody who uses selinux a little happier.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f23b5a59
  10. 01 7月, 2013 2 次提交
    • J
      Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate · a71754fc
      Josef Bacik 提交于
      This has plagued us forever and I'm so over working around it.  When we truncate
      down to a non-page aligned offset we will call btrfs_truncate_page to zero out
      the end of the page and write it back to disk, this will keep us from exposing
      stale data if we truncate back up from that point.  The problem with this is it
      requires data space to do this, and people don't really expect to get ENOSPC
      from truncate() for these sort of things.  This also tends to bite the orphan
      cleanup stuff too which keeps people from mounting.  To get around this we can
      just move this into btrfs_cont_expand() to make sure if we are truncating up
      from a non-page size aligned i_size we will zero out the rest of this page so
      that we don't expose stale data.  This will give ENOSPC if you try to truncate()
      up or if you try to write past the end of isize, which is much more reasonable.
      This fixes xfstests generic/083 failing to mount because of the orphan cleanup
      failing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a71754fc
    • J
      Btrfs: unlock extent range on enospc in compressed submit · fdf8e2ea
      Josef Bacik 提交于
      A user reported a deadlock where the async submit thread was blocked on the
      lock_extent() lock, and then everybody behind him was locked on the page lock
      for the page he was holding.  Looking at the code I noticed we do not unlock the
      extent range when we get ENOSPC and goto retry.  This is bad because we
      immediately try to lock that range again to do the cow, which will cause a
      deadlock.  Fix this by unlocking the range.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fdf8e2ea
  11. 29 6月, 2013 1 次提交