1. 05 10月, 2013 1 次提交
    • L
      Btrfs: fix crash of compressed writes · 385fe0be
      Liu Bo 提交于
      The crash[1] is found by xfstests/generic/208 with "-o compress",
      it's not reproduced everytime, but it does panic.
      
      The bug is quite interesting, it's actually introduced by a recent commit
      (573aecaf,
      Btrfs: actually limit the size of delalloc range).
      
      Btrfs implements delay allocation, so during writeback, we
      (1) get a page A and lock it
      (2) search the state tree for delalloc bytes and lock all pages within the range
      (3) process the delalloc range, including find disk space and create
          ordered extent and so on.
      (4) submit the page A.
      
      It runs well in normal cases, but if we're in a racy case, eg.
      buffered compressed writes and aio-dio writes,
      sometimes we may fail to lock all pages in the 'delalloc' range,
      in which case, we need to fall back to search the state tree again with
      a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
      
      The mentioned commit has a side effect, that is, in the fallback case,
      we can find delalloc bytes before the index of the page we already have locked,
      so we're in the case of (delalloc_end <= *start) and return with (found > 0).
      
      This ends with not locking delalloc pages but making ->writepage still
      process them, and the crash happens.
      
      This fixes it by just thinking that we find nothing and returning to caller
      as the caller knows how to deal with it properly.
      
      [1]:
      ------------[ cut here ]------------
      kernel BUG at mm/page-writeback.c:2170!
      [...]
      CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G           O 3.11.0+ #8
      [...]
      RIP: 0010:[<ffffffff810f5093>]  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
      [...]
      [ 4934.248731] Stack:
      [ 4934.248731]  ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
      [ 4934.248731]  0000000000000000 0000000000000000 0000000000000fff 0000000000000620
      [ 4934.248731]  ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
      [ 4934.248731] Call Trace:
      [ 4934.248731]  [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
      [ 4934.248731]  [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
      [ 4934.248731]  [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
      [ 4934.248731]  [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
      [ 4934.248731]  [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
      [ 4934.248731]  [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
      [ 4934.248731]  [<ffffffff810608f5>] kthread+0x8d/0x95
      [ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
      [ 4934.248731]  [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
      [ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
      [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
      [ 4934.248731] RIP  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
      [ 4934.248731]  RSP <ffff8801869f9c48>
      [ 4934.280307] ---[ end trace 36f06d3f8750236a ]---
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      385fe0be
  2. 21 9月, 2013 1 次提交
    • J
      Btrfs: actually limit the size of delalloc range · 573aecaf
      Josef Bacik 提交于
      So forever we have had this thing to limit the amount of delalloc pages we'll
      setup to be written out to 128mb.  This is because we have to lock all the pages
      in this range, so anything above this gets a bit unweildly, and also without a
      limit we'll happily allocate gigantic chunks of disk space.  Turns out our check
      for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
      write out, we'd just stop looking for more space after we went over the limit.
      So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
      extents.  This is fine normally, except when you go to relocate these extents
      and we can't find enough space to relocate these moster extents, since we have
      to be able to allocate exactly the same sized extent to move it around.  So fix
      this by actually enforcing the limit.  With this patch I'm no longer seeing
      giant 1.5gb extents.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      573aecaf
  3. 01 9月, 2013 14 次提交
  4. 10 8月, 2013 1 次提交
    • J
      Btrfs: do not offset physical if we're compressed · b76bb701
      Josef Bacik 提交于
      xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
      offsetting the physical based on the extent offset.  This is perfectly fine with
      uncompressed extents, however the extent offset is into the uncompressed area,
      not the compressed.  So we can return a physical value that isn't at all within
      the area we have allocated on disk.  Fix this by returning the start of the
      extent if it is compressed no matter what the offset.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b76bb701
  5. 02 7月, 2013 1 次提交
    • J
      Btrfs: check if we can nocow if we don't have data space · 7ee9e440
      Josef Bacik 提交于
      We always just try and reserve data space when we write, but if we are out of
      space but have prealloc'ed extents we should still successfully write.  This
      patch will try and see if we can write to prealloc'ed space and if we can go
      ahead and allow the write to continue.  With this patch we now pass xfstests
      generic/274.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7ee9e440
  6. 01 7月, 2013 1 次提交
    • J
      Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate · a71754fc
      Josef Bacik 提交于
      This has plagued us forever and I'm so over working around it.  When we truncate
      down to a non-page aligned offset we will call btrfs_truncate_page to zero out
      the end of the page and write it back to disk, this will keep us from exposing
      stale data if we truncate back up from that point.  The problem with this is it
      requires data space to do this, and people don't really expect to get ENOSPC
      from truncate() for these sort of things.  This also tends to bite the orphan
      cleanup stuff too which keeps people from mounting.  To get around this we can
      just move this into btrfs_cont_expand() to make sure if we are truncating up
      from a non-page size aligned i_size we will zero out the rest of this page so
      that we don't expose stale data.  This will give ENOSPC if you try to truncate()
      up or if you try to write past the end of isize, which is much more reasonable.
      This fixes xfstests generic/083 failing to mount because of the orphan cleanup
      failing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a71754fc
  7. 14 6月, 2013 1 次提交
  8. 22 5月, 2013 1 次提交
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  9. 18 5月, 2013 3 次提交
    • C
      Btrfs: use a btrfs bioset instead of abusing bio internals · 9be3395b
      Chris Mason 提交于
      Btrfs has been pointer tagging bi_private and using bi_bdev
      to store the stripe index and mirror number of failed IOs.
      
      As bios bubble back up through the call chain, we use these
      to decide if and how to retry our IOs.  They are also used
      to count IO failures on a per device basis.
      
      Recently a bio tracepoint was added lead to crashes because
      we were abusing bi_bdev.
      
      This commit adds a btrfs bioset, and creates explicit fields
      for the mirror number and stripe index.  The plan is to
      extend this structure for all of the fields currently in
      struct btrfs_bio, which will mean one less kmalloc in
      our IO path.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NTejun Heo <tj@kernel.org>
      9be3395b
    • A
      btrfs: do away with non-whole_page extent I/O · 17a5adcc
      Alexandre Oliva 提交于
      end_bio_extent_readpage computes whole_page based on bv_offset and
      bv_len, without taking into account that blk_update_request may modify
      them when some of the blocks to be read into a page produce a read
      error.  This would cause the read to unlock only part of the file
      range associated with the page, which would in turn leave the entire
      page locked, which would not only keep the process blocked instead of
      returning -EIO to it, but also prevent any further access to the file.
      
      It turns out that btrfs always issues whole-page reads and writes.
      The special handling of non-whole_page appears to be a mistake or a
      left-over from a time when this wasn't the case.  Indeed,
      end_bio_extent_writepage distinguished between whole_page and
      non-whole_page writes but behaved identically in both cases!
      
      I've replaced the whole_page computations with warnings, just to be
      sure that we're not issuing partial page reads or writes.  The
      warnings should probably just go away some time.
      Signed-off-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      17a5adcc
    • L
      Btrfs: fix off-by-one in fiemap · a52f4cd2
      Liu Bo 提交于
      lock_extent/unlock_extent expect an exclusive end.
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a52f4cd2
  10. 07 5月, 2013 8 次提交
  11. 27 3月, 2013 1 次提交
    • C
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason 提交于
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  12. 24 3月, 2013 1 次提交
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
  13. 02 3月, 2013 1 次提交
  14. 01 3月, 2013 1 次提交
  15. 27 2月, 2013 1 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
  16. 21 2月, 2013 2 次提交
  17. 20 2月, 2013 1 次提交