1. 21 11月, 2014 15 次提交
    • F
      Btrfs: make find_first_extent_bit be able to cache any state · e38e2ed7
      Filipe Manana 提交于
      Right now the only caller of find_first_extent_bit() that is interested
      in caching extent states (transaction or log commit), never gets an extent
      state cached. This is because find_first_extent_bit() only caches states
      that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
      the transaction/log commit caller always passes a tree that doesn't have
      ever extent states with any of those flags (they can only have one of the
      following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
      
      This change together with the following one in the patch series (titled
      "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
      help reduce significantly the chances of calls to convert_extent_bit()
      fail with -ENOMEM when called from the transaction/log commit code.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e38e2ed7
    • F
      Btrfs: deal with convert_extent_bit errors to avoid fs corruption · 663dfbb0
      Filipe Manana 提交于
      When committing a transaction or a log, we look for btree extents that
      need to be durably persisted by searching for ranges in a io tree that
      have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
      those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
      convert_extent_bit, and then start writeback for the extents.
      
      That function however can return an error (at the moment only -ENOMEM
      is possible, specially when it does GFP_ATOMIC allocation requests
      through alloc_extent_state_atomic) - that means the ranges didn't got
      the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
      which in turn means a call to btrfs_wait_marked_extents() won't find
      those ranges for which we started writeback, causing a transaction
      commit or a log commit to persist a new superblock without waiting
      for the writeback of extents in that range to finish first.
      
      Therefore if a crash happens after persisting the new superblock and
      before writeback finishes, we have a superblock pointing to roots that
      weren't fully persisted or roots that point to nodes or leafs that weren't
      fully persisted, causing all sorts of unexpected/bad behaviour as we endup
      reading garbage from disk or the content of some node/leaf from a past
      generation that got cowed or deleted and is no longer valid (for this later
      case we end up getting error messages like "parent transid verify failed on
      X wanted Y found Z" when reading btree nodes/leafs from disk).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      663dfbb0
    • E
      Btrfs: return failure if btrfs_dev_replace_finishing() failed · 2fc9f6ba
      Eryu Guan 提交于
      device replace could fail due to another running scrub process or any
      other errors btrfs_scrub_dev() may hit, but this failure doesn't get
      returned to userspace.
      
      The following steps could reproduce this issue
      
      	mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
      	mount /dev/sdb1 /mnt/btrfs
      	while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
      	btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
      	# if this replace succeeded, do the following and repeat until
      	# you see this log in dmesg
      	# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
      	#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
      
      	# once you see the error log in dmesg, check return value of
      	# replace
      	echo $?
      
      Introduce a new dev replace result
      
      BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
      
      to catch -EINPROGRESS explicitly and return other errors directly to
      userspace.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2fc9f6ba
    • S
      Btrfs: fix allocationg memory failure for btrfsic_state structure · 6b3a4d60
      Shilong Wang 提交于
      size of @btrfsic_state needs more than 2M, it is very likely to
      fail allocating memory using kzalloc(). see following mesage:
      
      [91428.902148] Call Trace:
      [<ffffffff816f6e0f>] dump_stack+0x4d/0x66
      [<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
      [<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
      [<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811d1018>] kmalloc_order+0x18/0x50
      [<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
      [<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
      [<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
      [<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
      [<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
      [<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
      [<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff81230429>] mount_fs+0x39/0x1b0
      [<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
      [<ffffffff812537fb>] do_mount+0x27b/0xc30
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff812544f6>] SyS_mount+0x96/0xf0
      [<ffffffff81701970>] system_call_fastpath+0x16/0x1b
      
      Since we are allocating memory for hash table array, so
      it will be good if we could allocate continuous pages here.
      
      Fix this problem by firstly trying kzalloc(), if we fail,
      use vzalloc() instead.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6b3a4d60
    • F
      Btrfs: report error after failure inlining extent in compressed write path · e6eb4314
      Filipe Manana 提交于
      If cow_file_range_inline() failed, when called from compress_file_range(),
      we were tagging the locked page for writeback, end its writeback and unlock it,
      but not marking it with an error nor setting AS_EIO in inode's mapping flags.
      
      This made it impossible for a caller of filemap_fdatawrite_range (writepages)
      or filemap_fdatawait_range() to know that an error happened. And the return
      value of compress_file_range() is useless because it's returned to a workqueue
      task and not to the task calling filemap_fdatawrite_range (writepages).
      
      This change applies on top of the previous patchset starting at the patch
      titled:
      
          "[1/5] Btrfs: set page and mapping error on compressed write failure"
      
      Which changed extent_clear_unlock_delalloc() to use SetPageError and
      mapping_set_error().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e6eb4314
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
    • F
      Btrfs: correctly flush compressed data before/after direct IO · 075bdbdb
      Filipe Manana 提交于
      For compressed writes, after doing the first filemap_fdatawrite_range() we
      don't get the pages tagged for writeback immediately. Instead we create
      a workqueue task, which is run by other kthread, and keep the pages locked.
      That other kthread compresses data, creates the respective ordered extent/s,
      tags the pages for writeback and unlocks them. Therefore we need a second
      call to filemap_fdatawrite_range() if we have compressed writes, as this
      second call will wait for the pages to become unlocked, then see they became
      tagged for writeback and finally wait for the writeback to finish.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      075bdbdb
    • F
      Btrfs: make inode.c:compress_file_range() return void · c44f649e
      Filipe Manana 提交于
      Its return value is useless, its single caller ignores it and can't do
      anything with it anyway, since it's a workqueue task and not the task
      calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
      Failure is communicated to such functions via start and end of writeback
      with the respective pages tagged with an error and AS_EIO flag set in the
      inode's imapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c44f649e
    • S
      Btrfs: fix incorrect compression ratio detection · 4bcbb332
      Shilong Wang 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sdb
       # mount -t btrfs /dev/sdb /mnt -o compress=lzo
       # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
      
      after previous steps, inode will be detected as bad compression ratio,
      and NOCOMPRESS flag will be set for that inode.
      
      Reason is that compress have a max limit pages every time(128K), if a
      132k write in, it will be splitted into two write(128k+4k), this bug
      is a leftover for commit 68bb462d(Btrfs: don't compress for a small write)
      
      Fix this problem by checking every time before compression, if it is a
      small write(<=blocksize), we bail out and fall into nocompression directly.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4bcbb332
    • F
      Btrfs: don't ignore compressed bio write errors · 7bdcefc1
      Filipe Manana 提交于
      Our compressed bio write end callback was essentially ignoring the error
      parameter. When a write error happens, it must pass a value of 0 to the
      inode's write_page_end_io_hook callback, SetPageError on the respective
      pages and set AS_EIO in the inode's mapping flags, so that a call to
      filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
      happened (we surely don't want silent failures on fsync for example).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7bdcefc1
    • F
      Btrfs: make inode.c:submit_compressed_extents() return void · dec8f175
      Filipe Manana 提交于
      Its return value is completely ignored by its single caller and it's
      useless anyway, since errors are indicated through SetPageError and
      the bit AS_EIO set in the flags of the inode's mapping. The caller
      can't do anything with the value, as it's invoked from a workqueue
      task and not by the task calling filemap_fdatawrite_range (which calls
      the writepages address space callback, which in turn calls the inode's
      fill_delalloc callback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dec8f175
    • F
      Btrfs: process all async extents on compressed write failure · 3d7a820f
      Filipe Manana 提交于
      If we had an error when processing one of the async extents from our list,
      we were not processing the remaining async extents, meaning we would leak
      those async_extent structs, never release the pages with the compressed
      data and never unlock and clear the dirty flag from the inode's pages (those
      that correspond to the uncompressed content).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d7a820f
    • F
      Btrfs: don't leak pages and memory on compressed write error · 40ae837b
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), if we fail before calling
      btrfs_submit_compressed_write(), or when that function fails, we
      were freeing the async_extent structure without releasing its pages
      and freeing the pages array.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      40ae837b
    • F
      Btrfs: fix hang on compressed write error · fce2a4e6
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
      we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
      btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
      we never end the writeback on the pages, so any filemap_fdatawait_range() call will
      hang forever. We were also not calling the writepage end io hook, which means the
      corresponding ordered extent will never complete and all its waiters will block
      forever, such as a full fsync (via btrfs_wait_ordered_range()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fce2a4e6
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  2. 14 11月, 2014 2 次提交
  3. 13 11月, 2014 11 次提交
  4. 07 11月, 2014 6 次提交
    • D
      xfs: track bulkstat progress by agino · 00275899
      Dave Chinner 提交于
      The bulkstat main loop progress is tracked by the "lastino"
      variable, which is a full 64 bit inode. However, the loop actually
      works on agno/agino pairs, and so there's a significant disconnect
      between the rest of the loop and the main cursor. Convert this to
      use the agino, and pass the agino into the chunk formatting function
      and convert it too.
      
      This gets rid of the inconsistency in the loop processing, and
      finally makes it simple for us to skip inodes at any point in the
      loop simply by incrementing the agino cursor.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      00275899
    • D
      xfs: bulkstat error handling is broken · febe3cbe
      Dave Chinner 提交于
      The error propagation is a horror - xfs_bulkstat() returns
      a rval variable which is only set if there are formatter errors. Any
      sort of btree walk error or corruption will cause the bulkstat walk
      to terminate but will not pass an error back to userspace. Worse
      is the fact that formatter errors will also be ignored if any inodes
      were correctly formatted into the user buffer.
      
      Hence bulkstat can fail badly yet still report success to userspace.
      This causes significant issues with xfsdump not dumping everything
      in the filesystem yet reporting success. It's not until a restore
      fails that there is any indication that the dump was bad and tha
      bulkstat failed. This patch now triggers xfsdump to fail with
      bulkstat errors rather than silently missing files in the dump.
      
      This now causes bulkstat to fail when the lastino cookie does not
      fall inside an existing inode chunk. The pre-3.17 code tolerated
      that error by allowing the code to move to the next inode chunk
      as the agino target is guaranteed to fall into the next btree
      record.
      
      With the fixes up to this point in the series, xfsdump now passes on
      the troublesome filesystem image that exposes all these bugs.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      febe3cbe
    • D
      xfs: bulkstat main loop logic is a mess · 6e57c542
      Dave Chinner 提交于
      There are a bunch of variables tha tare more wildy scoped than they
      need to be, obfuscated user buffer checks and tortured "next inode"
      tracking. This all needs cleaning up to expose the real issues that
      need fixing.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6e57c542
    • D
      xfs: bulkstat chunk-formatter has issues · 2b831ac6
      Dave Chinner 提交于
      The loop construct has issues:
      	- clustidx is completely unused, so remove it.
      	- the loop tries to be smart by terminating when the
      	  "freecount" tells it that all inodes are free. Just drop
      	  it as in most cases we have to scan all inodes in the
      	  chunk anyway.
      	- move the "user buffer left" condition check to the only
      	  point where we consume space int eh user buffer.
      	- move the initialisation of agino out of the loop, leaving
      	  just a simple loop control logic using the clusteridx.
      
      Also, double handling of the user buffer variables leads to problems
      tracking the current state - use the cursor variables directly
      rather than keeping local copies and then having to update the
      cursor before returning.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2b831ac6
    • D
      xfs: bulkstat chunk formatting cursor is broken · bf4a5af2
      Dave Chinner 提交于
      The xfs_bulkstat_agichunk formatting cursor takes buffer values from
      the main loop and passes them via the structure to the chunk
      formatter, and the writes the changed values back into the main loop
      local variables. Unfortunately, this complex dance is full of corner
      cases that aren't handled correctly.
      
      The biggest problem is that it is double handling the information in
      both the main loop and the chunk formatting function, leading to
      inconsistent updates and endless loops where progress is not made.
      
      To fix this, push the struct xfs_bulkstat_agichunk outwards to be
      the primary holder of user buffer information. this removes the
      double handling in the main loop.
      
      Also, pass the last inode processed by the chunk formatter as a
      separate parameter as it purely an output variable and is not
      related to the user buffer consumption cursor.
      
      Finally, the chunk formatting code is not shared by anyone, so make
      it local to xfs_itable.c.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bf4a5af2
    • D
      xfs: bulkstat btree walk doesn't terminate · afa947cb
      Dave Chinner 提交于
      The bulkstat code has several different ways of detecting the end of
      an AG when doing a walk. They are not consistently detected, and the
      code that checks for the end of AG conditions is not consistently
      coded. Hence the are conditions where the walk code can get stuck in
      an endless loop making no progress and not triggering any
      termination conditions.
      
      Convert all the "tmp/i" status return codes from btree operations
      to a common name (stat) and apply end-of-ag detection to these
      operations consistently.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      afa947cb
  5. 06 11月, 2014 1 次提交
  6. 05 11月, 2014 4 次提交
  7. 04 11月, 2014 1 次提交