1. 21 11月, 2014 23 次提交
    • F
      Btrfs: make xattr replace operations atomic · 5f5bc6b1
      Filipe Manana 提交于
      Replacing a xattr consists of doing a lookup for its existing value, delete
      the current value from the respective leaf, release the search path and then
      finally insert the new value. This leaves a time window where readers (getxattr,
      listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
      so this has security implications.
      
      This change also fixes 2 other existing issues which were:
      
      *) Deleting the old xattr value without verifying first if the new xattr will
         fit in the existing leaf item (in case multiple xattrs are packed in the
         same item due to name hash collision);
      
      *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
         exist but we have have an existing item that packs muliple xattrs with
         the same name hash as the input xattr. In this case we should return ENOSPC.
      
      A test case for xfstests follows soon.
      
      Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
      implementation.
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5f5bc6b1
    • F
      Btrfs: avoid premature -ENOMEM in clear_extent_bit() · c7bc6319
      Filipe Manana 提交于
      We try to allocate an extent state structure before acquiring the extent
      state tree's spinlock as we might need a new one later and therefore avoid
      doing later an atomic allocation while holding the tree's spinlock. However
      we returned -ENOMEM if that initial non-atomic allocation failed, which is
      a bit excessive since we might end up not needing the pre-allocated extent
      state at all - for the case where the tree doesn't have any extent states
      that cover the input range and cover too any other range. Therefore don't
      return -ENOMEM if that pre-allocation fails.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c7bc6319
    • J
      Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2 · 7e33fd99
      Josef Bacik 提交于
      Our gluster boxes get several thousand statfs() calls per second, which begins
      to suck hardcore with all of the lock contention on the chunk mutex and dev list
      mutex.  We don't really need to hold these things, if we have transient
      weirdness with statfs() because of the chunk allocator we don't care, so remove
      this locking.
      
      We still need the dev_list lock if you mount with -o alloc_start however, which
      is a good argument for nuking that thing from orbit, but that's a patch for
      another day.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7e33fd99
    • J
      Btrfs: move read only block groups onto their own list V2 · 633c0aad
      Josef Bacik 提交于
      Our gluster boxes were spending lots of time in statfs because our fs'es are
      huge.  The problem is statfs loops through all of the block groups looking for
      read only block groups, and when you have several terabytes worth of data that
      ends up being a lot of block groups.  Move the read only block groups onto a
      read only list and only proces that list in
      btrfs_account_ro_block_groups_free_space to reduce the amount of churn.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      633c0aad
    • D
      btrfs: fix typos in btrfs_check_super_valid · cd743fac
      David Sterba 提交于
      Copy&paste errors in some messages and add few more missing macro
      accessors.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      cd743fac
    • S
      Btrfs: check-int: don't complain about balanced blocks · cf90c59e
      Stefan Behrens 提交于
      The xfstest btrfs/014 which tests the balance operation caused that the
      check_int module complained that known blocks changed their physical
      location. Since this is not an error in this case, only print such
      message if the verbose mode was enabled.
      Reported-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Tested-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cf90c59e
    • S
      Btrfs: check_int: use the known block location · f382e465
      Stefan Behrens 提交于
      The xfstest btrfs/014 which tests the balance operation caused issues with
      the check_int module. The attempt was made to use btrfs_map_block() to
      find the physical location for a written block. However, this was not
      at all needed since the location of the written block was known since
      a hook to submit_bio() was the reason for entering the check_int module.
      Additionally, after a block relocation it happened that btrfs_map_block()
      failed causing misleading error messages afterwards.
      
      This patch changes the check_int module to use the known information of
      the physical location from the bio.
      Reported-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Tested-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f382e465
    • F
      Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early · c8fd3de7
      Filipe Manana 提交于
      We try to allocate an extent state before acquiring the tree's spinlock
      just in case we end up needing to split an existing extent state into two.
      If that allocation failed, we would return -ENOMEM.
      However, our only single caller (transaction/log commit code), passes in
      an extent state that was cached from a call to find_first_extent_bit() and
      that has a very high chance to match exactly the input range (always true
      for a transaction commit and very often, but not always, true for a log
      commit) - in this case we end up not needing at all that initial extent
      state used for an eventual split. Therefore just don't return -ENOMEM if
      we can't allocate the temporary extent state, since we might not need it
      at all, and if we end up needing one, we'll do it later anyway.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c8fd3de7
    • F
      Btrfs: make find_first_extent_bit be able to cache any state · e38e2ed7
      Filipe Manana 提交于
      Right now the only caller of find_first_extent_bit() that is interested
      in caching extent states (transaction or log commit), never gets an extent
      state cached. This is because find_first_extent_bit() only caches states
      that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
      the transaction/log commit caller always passes a tree that doesn't have
      ever extent states with any of those flags (they can only have one of the
      following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
      
      This change together with the following one in the patch series (titled
      "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
      help reduce significantly the chances of calls to convert_extent_bit()
      fail with -ENOMEM when called from the transaction/log commit code.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e38e2ed7
    • F
      Btrfs: deal with convert_extent_bit errors to avoid fs corruption · 663dfbb0
      Filipe Manana 提交于
      When committing a transaction or a log, we look for btree extents that
      need to be durably persisted by searching for ranges in a io tree that
      have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
      those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
      convert_extent_bit, and then start writeback for the extents.
      
      That function however can return an error (at the moment only -ENOMEM
      is possible, specially when it does GFP_ATOMIC allocation requests
      through alloc_extent_state_atomic) - that means the ranges didn't got
      the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
      which in turn means a call to btrfs_wait_marked_extents() won't find
      those ranges for which we started writeback, causing a transaction
      commit or a log commit to persist a new superblock without waiting
      for the writeback of extents in that range to finish first.
      
      Therefore if a crash happens after persisting the new superblock and
      before writeback finishes, we have a superblock pointing to roots that
      weren't fully persisted or roots that point to nodes or leafs that weren't
      fully persisted, causing all sorts of unexpected/bad behaviour as we endup
      reading garbage from disk or the content of some node/leaf from a past
      generation that got cowed or deleted and is no longer valid (for this later
      case we end up getting error messages like "parent transid verify failed on
      X wanted Y found Z" when reading btree nodes/leafs from disk).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      663dfbb0
    • E
      Btrfs: return failure if btrfs_dev_replace_finishing() failed · 2fc9f6ba
      Eryu Guan 提交于
      device replace could fail due to another running scrub process or any
      other errors btrfs_scrub_dev() may hit, but this failure doesn't get
      returned to userspace.
      
      The following steps could reproduce this issue
      
      	mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
      	mount /dev/sdb1 /mnt/btrfs
      	while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
      	btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
      	# if this replace succeeded, do the following and repeat until
      	# you see this log in dmesg
      	# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
      	#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
      
      	# once you see the error log in dmesg, check return value of
      	# replace
      	echo $?
      
      Introduce a new dev replace result
      
      BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
      
      to catch -EINPROGRESS explicitly and return other errors directly to
      userspace.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2fc9f6ba
    • S
      Btrfs: fix allocationg memory failure for btrfsic_state structure · 6b3a4d60
      Shilong Wang 提交于
      size of @btrfsic_state needs more than 2M, it is very likely to
      fail allocating memory using kzalloc(). see following mesage:
      
      [91428.902148] Call Trace:
      [<ffffffff816f6e0f>] dump_stack+0x4d/0x66
      [<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
      [<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
      [<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811d1018>] kmalloc_order+0x18/0x50
      [<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
      [<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
      [<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
      [<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
      [<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
      [<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
      [<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff81230429>] mount_fs+0x39/0x1b0
      [<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
      [<ffffffff812537fb>] do_mount+0x27b/0xc30
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff812544f6>] SyS_mount+0x96/0xf0
      [<ffffffff81701970>] system_call_fastpath+0x16/0x1b
      
      Since we are allocating memory for hash table array, so
      it will be good if we could allocate continuous pages here.
      
      Fix this problem by firstly trying kzalloc(), if we fail,
      use vzalloc() instead.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6b3a4d60
    • F
      Btrfs: report error after failure inlining extent in compressed write path · e6eb4314
      Filipe Manana 提交于
      If cow_file_range_inline() failed, when called from compress_file_range(),
      we were tagging the locked page for writeback, end its writeback and unlock it,
      but not marking it with an error nor setting AS_EIO in inode's mapping flags.
      
      This made it impossible for a caller of filemap_fdatawrite_range (writepages)
      or filemap_fdatawait_range() to know that an error happened. And the return
      value of compress_file_range() is useless because it's returned to a workqueue
      task and not to the task calling filemap_fdatawrite_range (writepages).
      
      This change applies on top of the previous patchset starting at the patch
      titled:
      
          "[1/5] Btrfs: set page and mapping error on compressed write failure"
      
      Which changed extent_clear_unlock_delalloc() to use SetPageError and
      mapping_set_error().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e6eb4314
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
    • F
      Btrfs: correctly flush compressed data before/after direct IO · 075bdbdb
      Filipe Manana 提交于
      For compressed writes, after doing the first filemap_fdatawrite_range() we
      don't get the pages tagged for writeback immediately. Instead we create
      a workqueue task, which is run by other kthread, and keep the pages locked.
      That other kthread compresses data, creates the respective ordered extent/s,
      tags the pages for writeback and unlocks them. Therefore we need a second
      call to filemap_fdatawrite_range() if we have compressed writes, as this
      second call will wait for the pages to become unlocked, then see they became
      tagged for writeback and finally wait for the writeback to finish.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      075bdbdb
    • F
      Btrfs: make inode.c:compress_file_range() return void · c44f649e
      Filipe Manana 提交于
      Its return value is useless, its single caller ignores it and can't do
      anything with it anyway, since it's a workqueue task and not the task
      calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
      Failure is communicated to such functions via start and end of writeback
      with the respective pages tagged with an error and AS_EIO flag set in the
      inode's imapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c44f649e
    • S
      Btrfs: fix incorrect compression ratio detection · 4bcbb332
      Shilong Wang 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sdb
       # mount -t btrfs /dev/sdb /mnt -o compress=lzo
       # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
      
      after previous steps, inode will be detected as bad compression ratio,
      and NOCOMPRESS flag will be set for that inode.
      
      Reason is that compress have a max limit pages every time(128K), if a
      132k write in, it will be splitted into two write(128k+4k), this bug
      is a leftover for commit 68bb462d(Btrfs: don't compress for a small write)
      
      Fix this problem by checking every time before compression, if it is a
      small write(<=blocksize), we bail out and fall into nocompression directly.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4bcbb332
    • F
      Btrfs: don't ignore compressed bio write errors · 7bdcefc1
      Filipe Manana 提交于
      Our compressed bio write end callback was essentially ignoring the error
      parameter. When a write error happens, it must pass a value of 0 to the
      inode's write_page_end_io_hook callback, SetPageError on the respective
      pages and set AS_EIO in the inode's mapping flags, so that a call to
      filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
      happened (we surely don't want silent failures on fsync for example).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7bdcefc1
    • F
      Btrfs: make inode.c:submit_compressed_extents() return void · dec8f175
      Filipe Manana 提交于
      Its return value is completely ignored by its single caller and it's
      useless anyway, since errors are indicated through SetPageError and
      the bit AS_EIO set in the flags of the inode's mapping. The caller
      can't do anything with the value, as it's invoked from a workqueue
      task and not by the task calling filemap_fdatawrite_range (which calls
      the writepages address space callback, which in turn calls the inode's
      fill_delalloc callback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dec8f175
    • F
      Btrfs: process all async extents on compressed write failure · 3d7a820f
      Filipe Manana 提交于
      If we had an error when processing one of the async extents from our list,
      we were not processing the remaining async extents, meaning we would leak
      those async_extent structs, never release the pages with the compressed
      data and never unlock and clear the dirty flag from the inode's pages (those
      that correspond to the uncompressed content).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d7a820f
    • F
      Btrfs: don't leak pages and memory on compressed write error · 40ae837b
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), if we fail before calling
      btrfs_submit_compressed_write(), or when that function fails, we
      were freeing the async_extent structure without releasing its pages
      and freeing the pages array.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      40ae837b
    • F
      Btrfs: fix hang on compressed write error · fce2a4e6
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
      we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
      btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
      we never end the writeback on the pages, so any filemap_fdatawait_range() call will
      hang forever. We were also not calling the writepage end io hook, which means the
      corresponding ordered extent will never complete and all its waiters will block
      forever, such as a full fsync (via btrfs_wait_ordered_range()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fce2a4e6
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  2. 14 11月, 2014 2 次提交
  3. 13 11月, 2014 11 次提交
  4. 07 11月, 2014 4 次提交
    • D
      xfs: track bulkstat progress by agino · 00275899
      Dave Chinner 提交于
      The bulkstat main loop progress is tracked by the "lastino"
      variable, which is a full 64 bit inode. However, the loop actually
      works on agno/agino pairs, and so there's a significant disconnect
      between the rest of the loop and the main cursor. Convert this to
      use the agino, and pass the agino into the chunk formatting function
      and convert it too.
      
      This gets rid of the inconsistency in the loop processing, and
      finally makes it simple for us to skip inodes at any point in the
      loop simply by incrementing the agino cursor.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      00275899
    • D
      xfs: bulkstat error handling is broken · febe3cbe
      Dave Chinner 提交于
      The error propagation is a horror - xfs_bulkstat() returns
      a rval variable which is only set if there are formatter errors. Any
      sort of btree walk error or corruption will cause the bulkstat walk
      to terminate but will not pass an error back to userspace. Worse
      is the fact that formatter errors will also be ignored if any inodes
      were correctly formatted into the user buffer.
      
      Hence bulkstat can fail badly yet still report success to userspace.
      This causes significant issues with xfsdump not dumping everything
      in the filesystem yet reporting success. It's not until a restore
      fails that there is any indication that the dump was bad and tha
      bulkstat failed. This patch now triggers xfsdump to fail with
      bulkstat errors rather than silently missing files in the dump.
      
      This now causes bulkstat to fail when the lastino cookie does not
      fall inside an existing inode chunk. The pre-3.17 code tolerated
      that error by allowing the code to move to the next inode chunk
      as the agino target is guaranteed to fall into the next btree
      record.
      
      With the fixes up to this point in the series, xfsdump now passes on
      the troublesome filesystem image that exposes all these bugs.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      febe3cbe
    • D
      xfs: bulkstat main loop logic is a mess · 6e57c542
      Dave Chinner 提交于
      There are a bunch of variables tha tare more wildy scoped than they
      need to be, obfuscated user buffer checks and tortured "next inode"
      tracking. This all needs cleaning up to expose the real issues that
      need fixing.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6e57c542
    • D
      xfs: bulkstat chunk-formatter has issues · 2b831ac6
      Dave Chinner 提交于
      The loop construct has issues:
      	- clustidx is completely unused, so remove it.
      	- the loop tries to be smart by terminating when the
      	  "freecount" tells it that all inodes are free. Just drop
      	  it as in most cases we have to scan all inodes in the
      	  chunk anyway.
      	- move the "user buffer left" condition check to the only
      	  point where we consume space int eh user buffer.
      	- move the initialisation of agino out of the loop, leaving
      	  just a simple loop control logic using the clusteridx.
      
      Also, double handling of the user buffer variables leads to problems
      tracking the current state - use the cursor variables directly
      rather than keeping local copies and then having to update the
      cursor before returning.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2b831ac6