1. 27 7月, 2016 1 次提交
    • M
      mm, memcg: use consistent gfp flags during readahead · 8a5c743e
      Michal Hocko 提交于
      Vladimir has noticed that we might declare memcg oom even during
      readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
      restriction) while __do_page_cache_readahead uses
      page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
      OOMs.  This gfp mask discrepancy is really unfortunate and easily
      fixable.  Drop page_cache_alloc_readahead() which only has one user and
      outsource the gfp_mask logic into readahead_gfp_mask and propagate this
      mask from __do_page_cache_readahead down to read_pages.
      
      This alone would have only very limited impact as most filesystems are
      implementing ->readpages and the common implementation mpage_readpages
      does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
      use readahead_gfp_mask instead as this function is called only during
      readahead as well.  The same applies to read_cache_pages.
      
      ext4 has its own ext4_mpage_readpages but the path which has pages !=
      NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
      doing a very similar pattern to mpage_readpages so the same can be
      applied to them as well.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@suse.com: restrict gfp mask in mpage_alloc]
        Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Changman Lee <cm224.lee@samsung.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5c743e
  2. 21 7月, 2016 2 次提交
    • C
      Btrfs: fix delalloc accounting after copy_from_user faults · 8b8b08cb
      Chris Mason 提交于
      Commit 56244ef1 was almost but not quite enough to fix the
      reservation math after btrfs_copy_from_user returned partial copies.
      
      Some users are still seeing warnings in btrfs_destroy_inode, and with a
      long enough test run I'm able to trigger them as well.
      
      This patch fixes the accounting math again, bringing it much closer to
      the way it was before the sectorsize conversion Chandan did.  The
      problem is accounting for the offset into the page/sector when we do a
      partial copy.  This one just uses the dirty_sectors variable which
      should already be updated properly.
      Signed-off-by: NChris Mason <clm@fb.com>
      cc: stable@vger.kernel.org # v4.6+
      8b8b08cb
    • J
      Btrfs: avoid deadlocks during reservations in btrfs_truncate_block · bac357dc
      Josef Bacik 提交于
      The new enospc code makes it possible to deadlock if we don't use
      FLUSH_LIMIT during reservations inside a transaction.  This enforces
      the correct flush type to avoid both deadlocks and assertions
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bac357dc
  3. 19 7月, 2016 1 次提交
  4. 08 7月, 2016 17 次提交
    • J
      Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f
      Josef Bacik 提交于
      We used to allow you to set FLUSH_ALL and then just wouldn't do things like
      commit transactions or wait on ordered extents if we noticed you were in a
      transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
      we've lost the ability to tell, and we could end up deadlocking.  So instead use
      FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
      we error out to preserve the previous behavior.  I've also added an ASSERT() to
      catch anybody else who tries to do this.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ca17f0f
    • J
      Btrfs: fill relocation block rsv after allocation · ac2fabac
      Josef Bacik 提交于
      Since we set the reloc control before we've reserved our space for relocation we
      could race with a root being dirtied and not actually have space to do our init
      reloc root.  So once we've allocated it and set it up go ahead and make our
      reservation before setting the relocate control, that way anybody who tries to
      do the reloc root init has space to use.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac2fabac
    • J
      Btrfs: always use trans->block_rsv for orphans · 40acc3ee
      Josef Bacik 提交于
      This is the case all the time anyway except for relocation which could be doing
      a reloc root for a non ref counted root, in which case we'd end up with some
      random block rsv rather than the one we have our reservation in.  If there isn't
      enough space in the block rsv we are trying to steal from we'll BUG() because we
      expect there to be space for the orphan to make its reservation.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40acc3ee
    • J
      Btrfs: change how we calculate the global block rsv · ae2e4728
      Josef Bacik 提交于
      Traditionally we've calculated the global block rsv by guessing how much of the
      metadata used amount was the extent tree, and then taking the data size and
      figuring out how large the csum tree would have to be to hold that much data.
      
      This is imprecise and falls down on MIXED file systems as we can't trust the
      data used amount.  This resulted in failures for xfstests generic/333 because it
      creates lots of clones, which explodes out the extent tree.  Our global reserve
      calculations were woefully inaccurate in this case which meant we got into a
      situation where we did not have enough reserved to do our work.
      
      We know we only use the global block rsv for the extent, csum, and root trees,
      so just get the bytes used for these trees and use that as the basis of our
      global reserve.  Since these are not reference counted trees the bytes_used
      value will be accurate.  This fixed the transaction aborts seen with
      generic/333.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae2e4728
    • J
      Btrfs: use root when checking need_async_flush · 87241c2e
      Josef Bacik 提交于
      Instead of doing fs_info->fs_root in need_async_flush, which may not be set
      during recovery when mounting, just pass the root itself in, which makes more
      sense as thats what btrfs_calc_reclaim_metadata_size takes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87241c2e
    • J
      Btrfs: don't bother kicking async if there's nothing to reclaim · d38b349c
      Josef Bacik 提交于
      We do this check when we start the async reclaimer thread, might as well check
      before we kick it off to save us some cycles.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d38b349c
    • J
      Btrfs: fix release reserved extents trace points · 31bada7c
      Josef Bacik 提交于
      We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
      isn't quite right because we will go through and free that extent later when we
      unpin, so it messes up apps that are accounting for the reservation space.  We
      were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
      only actually free the reservation instead of pinning the extent.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31bada7c
    • J
      Btrfs: add tracepoints for flush events · f376df2b
      Josef Bacik 提交于
      We want to track when we're triggering flushing from our reservation code and
      what flushing is being done when we start flushing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f376df2b
    • J
      Btrfs: fix delalloc reservation amount tracepoint · f485c9ee
      Josef Bacik 提交于
      We can sometimes drop the reservation we had for our inode, so we need to remove
      that amount from to_reserve so that our tracepoint reports a valid amount of
      space.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f485c9ee
    • J
      Btrfs: trace pinned extents · c51e7bb1
      Josef Bacik 提交于
      Pinned extents are an important metric to keep track of for enospc.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c51e7bb1
    • J
      Btrfs: introduce ticketed enospc infrastructure · 957780eb
      Josef Bacik 提交于
      Our enospc flushing sucks.  It is born from a time where we were early
      enospc'ing constantly because multiple threads would race in for the same
      reservation and randomly starve other ones out.  So I came up with this solution
      to block any other reservations from happening while one guy tried to flush
      stuff to satisfy his reservation.  This gives us pretty good correctness, but
      completely crap latency.
      
      The solution I've come up with is ticketed reservations.  Basically we try to
      make our reservation, and if we can't we put a ticket on a list in order and
      kick off an async flusher thread.  This async flusher thread does the same old
      flushing we always did, just asynchronously.  As space is freed and added back
      to the space_info it checks and sees if we have any tickets that need
      satisfying, and adds space to the tickets and wakes up anything we've satisfied.
      
      Once the flusher thread stops making progress it wakes up all the current
      tickets and tells them to take a hike.
      
      There is a priority list for things that can't flush, since the async flusher
      could do anything we need to avoid deadlocks.  These guys get priority for
      having their reservation made, and will still do manual flushing themselves in
      case the async flusher isn't running.
      
      This patch gives us significantly better latencies.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      957780eb
    • J
      Btrfs: add tracepoint for adding block groups · c83f8eff
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system inside btrfs, I need this
      tracepoint in order to keep track of the block groups in the system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83f8eff
    • J
      Btrfs: warn_on for unaccounted spaces · d555b6c3
      Josef Bacik 提交于
      These were hidden behind enospc_debug, which isn't helpful as they indicate
      actual bugs, unlike the rest of the enospc_debug stuff which is really debug
      information.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d555b6c3
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: always reserve metadata for delalloc extents · 48c3d480
      Josef Bacik 提交于
      There are a few races in the metadata reservation stuff.  First we add the bytes
      to the block_rsv well after we've set the bit on the inode saying that we have
      space for it and after we've reserved the bytes.  So use the normal
      btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
      extents when we try to reserve space for our write, which means that we could
      have used up the space for the inode and we wouldn't know because we only check
      before the reservation.  So instead make sure we are always reserving space for
      the inode update, and then if we don't need it release those bytes afterward.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48c3d480
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
    • J
      Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d
      Josef Bacik 提交于
      For some reason we're adding bytes_readonly to the space info after we update
      the space info with the block group info.  This creates a tiny race where we
      could over-reserve space because we haven't yet taken out the bytes_readonly
      bit.  Since we already know this information at the time we call
      update_space_info, just pass it along so it can be updated all at once.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e40edf2d
  5. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  6. 24 6月, 2016 4 次提交
    • C
      Btrfs: Force stripesize to the value of sectorsize · b7f67055
      Chandan Rajendra 提交于
      Btrfs code currently assumes stripesize to be same as
      sectorsize. However Btrfs-progs (until commit
      df05c7ed455f519e6e15e46196392e4757257305) has been setting
      btrfs_super_block->stripesize to a value of 4096.
      
      This commit makes sure that the value of btrfs_super_block->stripesize
      is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
      to sectorsize.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b7f67055
    • W
      btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610
      Wang Xiaoguang 提交于
      When doing truncate operation, btrfs_setsize() will first call
      truncate_setsize() to set new inode->i_size, but if later
      btrfs_truncate() fails, btrfs_setsize() will call
      "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
      inmemory inode size, now bug occurs. It's because for truncate
      case btrfs_ordered_update_i_size() directly uses inode->i_size
      to update BTRFS_I(inode)->disk_i_size, indeed we should use the
      "offset" argument to update disk_i_size. Here is the call graph:
      ==>btrfs_truncate()
      ====>btrfs_truncate_inode_items()
      ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
      Here btrfs_ordered_update_i_size()'s offset argument is last_size.
      
      And below test case can reveal this bug:
      
      dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
      dev=$(losetup --show -f fs.img)
      mkdir -p /mnt/mntpoint
      mkfs.btrfs  -f $dev
      mount $dev /mnt/mntpoint
      cd /mnt/mntpoint
      
      echo "workdir is: /mnt/mntpoint"
      blocksize=$((128 * 1024))
      dd if=/dev/zero of=testfile bs=$blocksize count=1
      sync
      count=$((17*1024*1024*1024/blocksize))
      echo "file size is:" $((count*blocksize))
      for ((i = 1; i <= $count; i++)); do
      	i=$((i + 1))
      	dst_offset=$((blocksize * i))
      	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
      		testfile > /dev/null
      done
      sync
      
      truncate --size 0 testfile
      ls -l testfile
      du -sh testfile
      exit
      
      In this case, truncate operation will fail for enospc reason and
      "du -sh testfile" returns value greater than 0, but testfile's
      size is 0, we need to reflect correct inode->i_size.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d2f610
    • L
      Btrfs: fix error handling in map_private_extent_buffer · 415b35a5
      Liu Bo 提交于
      map_private_extent_buffer() can return -EINVAL in two different cases,
      1. when the requested contents span two pages if nodesize is larger
         than pagesize,
      2. when it detects something insane.
      
      The 2nd one used to be only a WARN_ON(1), and we decided to return a error
      to callers, but we didn't fix up all its callers, which will be
      addressed by this patch.
      
      Without this, btrfs may end up with 'general protection', ie.
      reading invalid memory.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      415b35a5
    • W
      Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a
      Wei Yongjun 提交于
      Fix to return a negative error code from the kern_mount() error handling
      case instead of 0(ret is set to 0 by register_filesystem), as done
      elsewhere in this function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04e1b65a
  7. 23 6月, 2016 3 次提交
    • J
      Btrfs: don't do nocow check unless we have to · c6887cd1
      Josef Bacik 提交于
      Before we write into prealloc/nocow space we have to make sure that there are no
      references to the extents we are writing into, which means checking the extent
      tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
      unless we can't reserve data space, since it's a serious drag on performance.
      With the following sequence
      
      fallocate -l10737418240 /mnt/btrfs-test/file
      cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
      fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
      	--end_fsync=1
      
      we get the worst case scenario where we have to fall back on to doing the check
      anyway.
      
      Without this patch
      lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
      write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
      
      With this patch
      lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
      write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
      
      We get twice the throughput, half of the runtime, and half of the average
      latency.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      [ PAGE_CACHE_ removal related fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c6887cd1
    • C
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason 提交于
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f873eca
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  8. 18 6月, 2016 8 次提交
  9. 08 6月, 2016 3 次提交