1. 01 7月, 2013 2 次提交
    • J
      Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate · a71754fc
      Josef Bacik 提交于
      This has plagued us forever and I'm so over working around it.  When we truncate
      down to a non-page aligned offset we will call btrfs_truncate_page to zero out
      the end of the page and write it back to disk, this will keep us from exposing
      stale data if we truncate back up from that point.  The problem with this is it
      requires data space to do this, and people don't really expect to get ENOSPC
      from truncate() for these sort of things.  This also tends to bite the orphan
      cleanup stuff too which keeps people from mounting.  To get around this we can
      just move this into btrfs_cont_expand() to make sure if we are truncating up
      from a non-page size aligned i_size we will zero out the rest of this page so
      that we don't expose stale data.  This will give ENOSPC if you try to truncate()
      up or if you try to write past the end of isize, which is much more reasonable.
      This fixes xfstests generic/083 failing to mount because of the orphan cleanup
      failing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a71754fc
    • J
      Btrfs: unlock extent range on enospc in compressed submit · fdf8e2ea
      Josef Bacik 提交于
      A user reported a deadlock where the async submit thread was blocked on the
      lock_extent() lock, and then everybody behind him was locked on the page lock
      for the page he was holding.  Looking at the code I noticed we do not unlock the
      extent range when we get ENOSPC and goto retry.  This is bad because we
      immediately try to lock that range again to do the cow, which will cause a
      deadlock.  Fix this by unlocking the range.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fdf8e2ea
  2. 14 6月, 2013 6 次提交
  3. 09 6月, 2013 1 次提交
  4. 18 5月, 2013 4 次提交
  5. 07 5月, 2013 12 次提交
    • D
      41074888
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • L
      Btrfs: return free space in cow error path · ace68bac
      Liu Bo 提交于
      Replace some BUG_ONs with proper handling and take allocated space back to
      free space cache for later use.
      
      We don't have to worry about extent maps since they'd be freed in releasepage
      path.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ace68bac
    • J
      Btrfs: fix extent logging with O_DIRECT into prealloc · eb384b55
      Josef Bacik 提交于
      This is the same as the fix from commit
      
      Btrfs: fix bad extent logging
      
      but for O_DIRECT.  I missed this when I fixed the problem originally, we were
      still using the em for the orig_start and orig_block_len, which would be the
      merged extent.  We need to use the actual extent from the on disk file extent
      item, which we have to lookup to make sure it's ok to nocow anyway so just pass
      in some pointers to hold this info.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      eb384b55
    • T
      Btrfs: cleanup of function where fixup_low_keys() is called · afe5fea7
      Tsutomu Itoh 提交于
      If argument 'trans' is unnecessary in the function where
      fixup_low_keys() is called, 'trans' is deleted.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      afe5fea7
    • Z
      btrfs: abort unlink trans in missed error case · d4e3991b
      Zach Brown 提交于
      __btrfs_unlink_inode() aborts its transaction when it sees errors after
      it removes the directory item.  But it missed the case where
      btrfs_del_dir_entries_in_log() returns an error.  If this happens then
      the unlink appears to fail but the items have been removed without
      updating the directory size.  The directory then has leaked bytes in
      i_size and can never be removed.
      
      Adding the missing transaction abort at least makes this failure
      consistent with the other failure cases.
      
      I noticed this while reading the code after someone on irc reported
      having a directory with i_size but no entries.  I tested it by forcing
      btrfs_del_dir_entries_in_log() to return -ENOMEM.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d4e3991b
    • J
      Btrfs: fix bad extent logging · 09a2a8f9
      Josef Bacik 提交于
      A user sent me a btrfs-image of a file system that was panicing on mount during
      the log recovery.  I had originally thought these problems were from a bug in
      the free space cache code, but that was just a symptom of the problem.  The
      problem is if your application does something like this
      
      [prealloc][prealloc][prealloc]
      
      the internal extent maps will merge those all together into one extent map, even
      though on disk they are 3 separate extents.  So if you go to write into one of
      these ranges the extent map will be right since we use the physical extent when
      doing the write, but when we log the extents they will use the wrong sizes for
      the remainder prealloc space.  If this doesn't happen to trip up the free space
      cache (which it won't in a lot of cases) then you will get bogus entries in your
      extent tree which will screw stuff up later.  The data and such will still work,
      but everything else is broken.  This patch fixes this by not allowing extents
      that are on the modified list to be merged.  This has the side effect that we
      are no longer adding everything to the modified list all the time, which means
      we now have to call btrfs_drop_extents every time we log an extent into the
      tree.  So this allows me to drop all this speciality code I was using to get
      around calling btrfs_drop_extents.  With this patch the testcase I've created no
      longer creates a bogus file system after replaying the log.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      09a2a8f9
    • J
      Btrfs: log ram bytes properly · cc95bef6
      Josef Bacik 提交于
      When logging changed extents I was logging ram_bytes as the current length,
      which isn't correct, it's supposed to be the ram bytes of the original extent.
      This is for compression where even if we split the extent we need to know the
      ram bytes so when we uncompress the extent we know how big it will be.  This was
      still working out right with compression for some reason but I think we were
      getting lucky.  It was definitely off for prealloc which is why I noticed it,
      btrfsck was complaining about it.  With this patch btrfsck no longer complains
      after a log replay.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cc95bef6
    • D
      btrfs: make orphan cleanup less verbose · 4884b476
      David Sterba 提交于
      The messages
      
        btrfs: unlinked 123 orphans
        btrfs: truncated 456 orphans
      
      are not useful to regular users and raise questions whether there are
      problems with the filesystem.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4884b476
    • S
      Btrfs: Include the device in most error printk()s · c2cf52eb
      Simon Kirby 提交于
      With more than one btrfs volume mounted, it can be very difficult to find
      out which volume is hitting an error. btrfs_error() will print this, but
      it is currently rigged as more of a fatal error handler, while many of
      the printk()s are currently for debugging and yet-unhandled cases.
      
      This patch just changes the functions where the device information is
      already available. Some cases remain where the root or fs_info is not
      passed to the function emitting the error.
      
      This may introduce some confusion with volumes backed by multiple devices
      emitting errors referring to the primary device in the set instead of the
      one on which the error occurred.
      
      Use btrfs_printk(fs_info, format, ...) rather than writing the device
      string every time, and introduce macro wrappers ala XFS for brevity.
      Since the function already cannot be used for continuations, print a
      newline as part of the btrfs_printk() message rather than at each caller.
      Signed-off-by: NSimon Kirby <sim@hostway.ca>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c2cf52eb
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
    • L
      Btrfs: cleanup unused arguments of btrfs_csum_data · b0496686
      Liu Bo 提交于
      Argument 'root' is no more used in btrfs_csum_data().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0496686
  6. 28 3月, 2013 2 次提交
    • M
      Btrfs: fix wrong reservation of csums · 39847c4d
      Miao Xie 提交于
      We reserve the space for csums only when we write data into a file, in
      the other cases, such as tree log, log replay, we don't do reservation,
      so we can use the reservation of the transaction handle just for the former.
      And for the latter, we should use the tree's own reservation. But the
      function - btrfs_csum_file_blocks() didn't differentiate between these
      two types of the cases, fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      39847c4d
    • J
      Btrfs: fix space accounting for unlink and rename · 6e137ed3
      Josef Bacik 提交于
      We are way over-reserving for unlink and rename.  Rename is just some random
      huge number and unlink accounts for tree log operations that don't actually
      happen during unlink, not to mention the tree log doesn't take from the trans
      block rsv anyway so it's completely useless.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6e137ed3
  7. 27 3月, 2013 1 次提交
    • C
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason 提交于
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  8. 15 3月, 2013 1 次提交
  9. 06 3月, 2013 1 次提交
  10. 01 3月, 2013 1 次提交
    • J
      Btrfs: copy everything if we've created an inline extent · bdc20e67
      Josef Bacik 提交于
      I noticed while looking into a tree logging bug that we aren't logging inline
      extents properly.  Since this requires copying and it shouldn't happen too often
      just force us to copy everything for the inode into the tree log when we have an
      inline extent.  With this patch we have valid data after a crash when we write
      an inline extent.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      bdc20e67
  11. 27 2月, 2013 2 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
    • L
      Btrfs: do not change inode flags in rename · 8c4ce81e
      Liu Bo 提交于
      Before we forced to change a file's NOCOW and COMPRESS flag due to
      the parent directory's, but this ends up a bad idea, because it
      confuses end users a lot about file's NOCOW status, eg. if someone
      change a file to NOCOW via 'chattr' and then rename it in the current
      directory which is without NOCOW attribute, the file will lose the
      NOCOW flag silently.
      
      This diables 'change flags in rename', so from now on we'll only
      inherit flags from the parent directory on creation stage while in
      other places we can use 'chattr' to set NOCOW or COMPRESS flags.
      Reported-by: NMarios Titas <redneb8888@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      8c4ce81e
  12. 26 2月, 2013 1 次提交
    • J
      Btrfs: make sure NODATACOW also gets NODATASUM set · f2bdf9a8
      Josef Bacik 提交于
      A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
      csums on a NOCOW extent.  This can happen if we have NODATACOW set but not
      NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
      and then write into preallocated space, or chattr +C a directory and move a file
      into that directory.  Liu has fixed the move case in a different place, but this
      fixes the mount -o nodatacow case.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f2bdf9a8
  13. 23 2月, 2013 1 次提交
  14. 21 2月, 2013 5 次提交
    • M
      Btrfs: fix wrong outstanding_extents when doing DIO write · 172a5049
      Miao Xie 提交于
      When running the 083th case of xfstests on the filesystem with
      "compress-force=lzo", the following WARNINGs were triggered.
        WARNING: at fs/btrfs/inode.c:7908
        WARNING: at fs/btrfs/inode.c:7909
        WARNING: at fs/btrfs/inode.c:7911
        WARNING: at fs/btrfs/extent-tree.c:4510
        WARNING: at fs/btrfs/extent-tree.c:4511
      
      This problem was introduced by the patch "Btrfs: fix deadlock due
      to unsubmitted". In this patch, there are two bugs which caused
      the above problem.
      
      The 1st one is a off-by-one bug, if the DIO write return 0, it is
      also a short write, we need release the reserved space for it. But
      we didn't do it in that patch. Fix it by change "ret > 0" to
      "ret >= 0".
      
      The 2nd one is ->outstanding_extents was increased twice when
      a short write happened. As we know, ->outstanding_extents is
      a counter to keep track of the number of extent items we may
      use duo to delalloc, when we reserve the free space for a
      delalloc write, we assume that the write will introduce just
      one extent item, so we increase ->outstanding_extents by 1 at
      that time. And then we will increase it every time we split the
      write, it is done at the beginning of btrfs_get_blocks_direct().
      So when a short write happens, we needn't increase
      ->outstanding_extents again. But this patch done.
      
      In order to fix the 2nd problem, I re-write the logic for
      ->outstanding_extents operation. We don't increase it at the
      beginning of btrfs_get_blocks_direct(), instead, we just
      increase it when the split actually happens.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      172a5049
    • L
      Btrfs: snapshot-aware defrag · 38c227d8
      Liu Bo 提交于
      This comes from one of btrfs's project ideas,
      As we defragment files, we break any sharing from other snapshots.
      The balancing code will preserve the sharing, and defrag needs to grow this
      as well.
      
      Now we're able to fill the blank with this patch, in which we make full use of
      backref walking stuff.
      
      Here is the basic idea,
      o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
      o  at endio, after we finish updating fs tree, we use backref walking to find
         all parents of the ranges and re-link them with the new COWed file layout by
         adding corresponding backrefs.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      38c227d8
    • Z
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown 提交于
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      24542bf7
    • L
      Btrfs: fix cleaner thread not working with inode cache option · fa6ac876
      Liu Bo 提交于
      Right now inode cache inode is treated as the same as space cache
      inode, ie. keep inode in memory till putting super.
      
      But this leads to an awkward situation.
      
      If we're going to delete a snapshot/subvolume, btrfs will not
      actually delete it and return free space, but will add it to dead
      roots list until the last inode on this snap/subvol being destroyed.
      Then we'll fetch deleted roots and cleanup them via cleaner thread.
      
      So here is the problem, if we enable inode cache option, each
      snap/subvol has a cached inode which is used to store inode allcation
      information.  And this cache inode will be kept in memory, as the above
      said.  So with inode cache, snap/subvol can only be added into
      dead roots list during freeing roots stage in umount, so that we can
      ONLY get space back after another remount(we cleanup dead roots on mount).
      
      But the real thing is we'll no more use the snap/subvol if we mark it
      deleted, so we can safely iput its cache inode when we delete snap/subvol.
      
      Another thing is that we need to change the rules of droping inode, we
      don't keep snap/subvol's cache inode in memory till end so that we can
      add snap/subvol into dead roots list in time.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fa6ac876
    • M
      Btrfs: implement unlocked dio write · 38851cc1
      Miao Xie 提交于
      This idea is from ext4. By this patch, we can make the dio write parallel,
      and improve the performance. But because we can not update isize without
      i_mutex, the unlocked dio write just can be done in front of the EOF.
      
      We needn't worry about the race between dio write and truncate, because the
      truncate need wait untill all the dio write end.
      
      And we also needn't worry about the race between dio write and punch hole,
      because we have extent lock to protect our operation.
      
      I ran fio to test the performance of this feature.
      
      == Hardware ==
      CPU: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
      Mem: 2GB
      SSD: Intel X25-M 120GB (Test Partition: 60GB)
      
      == config file ==
      [global]
      ioengine=psync
      direct=1
      bs=4k
      size=32G
      runtime=60
      directory=/mnt/btrfs/
      filename=testfile
      group_reporting
      thread
      
      [file1]
      numjobs=1 # 2 4
      rw=randwrite
      
      == result (KBps) ==
      write	1	2	4
      lock	24936	24738	24726
      nolock	24962	30866	32101
      
      == result (iops) ==
      write	1	2	4
      lock	6234	6184	6181
      nolock	6240	7716	8025
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      38851cc1