1. 11 3月, 2014 40 次提交
    • Q
      btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue · afe3d242
      Qu Wenruo 提交于
      Much like the fs_info->workers, replace the fs_info->delalloc_workers
      use the same btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      afe3d242
    • Q
      btrfs: Replace fs_info->workers with btrfs_workqueue. · 5cdc7ad3
      Qu Wenruo 提交于
      Use the newly created btrfs_workqueue_struct to replace the original
      fs_info->workers
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5cdc7ad3
    • Q
      btrfs: Add threshold workqueue based on kernel workqueue · 0bd9289c
      Qu Wenruo 提交于
      The original btrfs_workers has thresholding functions to dynamically
      create or destroy kthreads.
      
      Though there is no such function in kernel workqueue because the worker
      is not created manually, we can still use the workqueue_set_max_active
      to simulated the behavior, mainly to achieve a better HDD performance by
      setting a high threshold on submit_workers.
      (Sadly, no resource can be saved)
      
      So in this patch, extra workqueue pending counters are introduced to
      dynamically change the max active of each btrfs_workqueue_struct, hoping
      to restore the behavior of the original thresholding function.
      
      Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
      which is not meant to be called too frequently, so a new interval
      mechanism is applied, that will only call workqueue_set_max_active after
      a count of work is queued. Hoping to balance both the random and
      sequence performance on HDD.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      0bd9289c
    • Q
      btrfs: Add high priority workqueue support for btrfs_workqueue_struct · 1ca08976
      Qu Wenruo 提交于
      Add high priority function to btrfs_workqueue.
      
      This is implemented by embedding a btrfs_workqueue into a
      btrfs_workqueue and use some helper functions to differ the normal
      priority wq and high priority wq.
      So the high priority wq is completely independent from the normal
      workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1ca08976
    • Q
      btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue · 08a9ff32
      Qu Wenruo 提交于
      Use kernel workqueue to implement a new btrfs_workqueue_struct, which
      has the ordering execution feature like the btrfs_worker.
      
      The func is executed in a concurrency way, and the
      ordred_func/ordered_free is executed in the sequence them are queued
      after the corresponding func is done.
      
      The new btrfs_workqueue works much like the original one, one workqueue
      for normal work and a list for ordered work.
      When a work is queued, ordered work will be added to the list and helper
      function will be queued into the workqueue.
      The helper function will execute a normal work and then check and execute as many
      ordered work as possible in the sequence they were queued.
      
      At this patch, high priority work queue or thresholding is not added yet.
      The high priority feature and thresholding will be added in the following patches.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      08a9ff32
    • Q
      btrfs: Cleanup the unused struct async_sched. · f5961d41
      Qu Wenruo 提交于
      The struct async_sched is not used by any codes and can be removed.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fusionio.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      f5961d41
    • L
      Btrfs: skip search tree for REG files · 644d1940
      Liu Bo 提交于
      It is really unnecessary to search tree again for @gen, @mode and @rdev
      in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
      and @gen, @mode and @rdev can easily be fetched.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      644d1940
    • M
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie 提交于
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7b2b7085
    • M
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie 提交于
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c933956d
    • D
      9c9ca00b
    • D
      btrfs: send: fix old buffer length in fs_path_ensure_buf · 1b2782c8
      David Sterba 提交于
      In "btrfs: send: lower memory requirements in common case" the code to
      save the old_buf_len was incorrectly moved to a wrong place and broke
      the original logic.
      Reported-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1b2782c8
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: more efficient split extent state insertion · f2071b21
      Filipe Manana 提交于
      When we split an extent state there's no need to start the rbtree search
      from the root node - we can start it from the original extent state node,
      since we would end up in its subtree if we do the search starting at the
      root node anyway.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      f2071b21
    • F
      Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928
      Filipe Manana 提交于
      We don't need to have an unsigned int field in the extent_map struct
      to tell us whether the extent map is in the inode's extent_map tree or
      not. We can use the rb_node struct field and the RB_CLEAR_NODE and
      RB_EMPTY_NODE macros to achieve the same task.
      
      This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
      64 bits system).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      cbc0e928
    • W
      Btrfs: skip locking when searching commit root · e84752d4
      Wang Shilong 提交于
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e84752d4
    • W
      Btrfs: wake up @scrub_pause_wait as much as we can · 32a44789
      Wang Shilong 提交于
      check if @scrubs_running=@scrubs_paused condition inside wait_event()
      is not an atomic operation which means we may inc/dec @scrub_running/
      paused at any time. Let's wake up @scrub_pause_wait as much as we can
      to let commit transaction blocked less.
      
      An example below:
      
      Thread1				Thread2
      |->scrub_blocked_if_needed()	|->scrub_pending_trans_workers_inc
        |->increase @scrub_paused
                                             |->increase @scrub_running
        |->wake up scrub_pause_wait list
                                             |->scrub blocked
                                             |->increase @scrub_paused
      
      Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
      So after Thread2 increase @scrub_paused, we meet the condition
      @scrub_paused=@scrub_running, but transaction will be still blocked until
      another calling to wake up @scrub_pause_wait.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      32a44789
    • W
      Btrfs: cancel scrub on transaction abortion · c0af8f0b
      Wang Shilong 提交于
      If we fail to commit transaction, we'd better
      cancel scrub operations.
      Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c0af8f0b
    • W
      Btrfs: device_replace: fix deadlock for nocow case · 12cf9372
      Wang Shilong 提交于
      commit cb7ab021 cause a following deadlock found by
      xfstests,btrfs/011:
      
      Thread1 is commiting transaction which is blocked at
      btrfs_scrub_pause().
      
      Thread2 is calling btrfs_file_aio_write() which has held
      inode's @i_mutex and commit transaction(blocked because
      Thread1 is committing transaction).
      
      Thread3 is copy_nocow_page worker which will also try to
      hold inode @i_mutex, so thread3 will wait Thread1 finished.
      
      Thread4 is waiting pending workers finished which will wait
      Thread3 finished. So the problem is like this:
      
      Thread1--->Thread4--->Thread3--->Thread2---->Thread1
      
      Deadlock happens! we fix it by letting Thread1 go firstly,
      which means we won't block transaction commit while we are
      waiting pending workers finished.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12cf9372
    • W
      Btrfs: fix a possible deadlock between scrub and transaction committing · 6cf7f77e
      Wang Shilong 提交于
      btrfs_scrub_continue() will be called when cleaning up transaction.However,
      this can only be called if btrfs_scrub_pause() is called before.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6cf7f77e
    • S
      btrfs: Use PTR_ERR_OR_ZERO · 886322e8
      Sachin Kamat 提交于
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      886322e8
    • F
      Btrfs: fix send issuing outdated paths for utimes, chown and chmod · bf0d1f44
      Filipe Manana 提交于
      When doing an incremental send, if we had a directory pending a move/rename
      operation and none of its parents, except for the immediate parent, were
      pending a move/rename, after processing the directory's references, we would
      be issuing utimes, chown and chmod intructions against am outdated path - a
      path which matched the one in the parent root.
      
      This change also simplifies a bit the code that deals with building a path
      for a directory which has a move/rename operation delayed.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/d/e
          $ mkdir /mnt/btrfs/a/b/c/f
          $ chmod 0777 /mnt/btrfs/a/b/c/d/e
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
          $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
          $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
          $ chmod 0700 /mnt/btrfs/a/b/f2/e2
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: chmod a/b/c/d/e failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bf0d1f44
    • F
      Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293
      Filipe Manana 提交于
      Just comparing the pointers (logical disk addresses) of the btree nodes is
      not completely bullet proof, we have to check if their generation numbers
      match too.
      
      It is guaranteed that a COW operation will result in a block with a different
      logical disk address than the original block's address, but over time we can
      reuse that former logical disk address.
      
      For example, creating a 2Gb filesystem on a loop device, and having a script
      running in a loop always updating the access timestamp of a file, resulted in
      the same logical disk address being reused for the same fs btree block in about
      only 4 minutes.
      
      This could make us skip entire subtrees when doing an incremental send (which
      is currently the only user of btrfs_compare_trees). However the odds of getting
      2 blocks at the same tree level, with the same logical disk address, equal first
      slot keys and different generations, should hopefully be very low.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6baa4293
    • F
      Btrfs: fix send attempting to rmdir non-empty directories · 9dc44214
      Filipe Manana 提交于
      The incremental send algorithm assumed that it was possible to issue
      a directory remove (rmdir) if the the inode number it was currently
      processing was greater than (or equal) to any inode that referenced
      the directory's inode. This wasn't a valid assumption because any such
      inode might be a child directory that is pending a move/rename operation,
      because it was moved into a directory that has a higher inode number and
      was moved/renamed too - in other words, the case the following commit
      addressed:
      
          9f03740a
          (Btrfs: fix infinite path build loops in incremental send)
      
      This made an incremental send issue an rmdir operation before the
      target directory was actually empty, which made btrfs receive fail.
      Therefore it needs to wait for all pending child directory inodes to
      be moved/renamed before sending an rmdir operation.
      
      Simple steps to reproduce this issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/x
          $ mkdir /mnt/btrfs/a/b/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
          $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. Directory not empty
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      9dc44214
    • F
      Btrfs: send, don't send rmdir for same target multiple times · 29d6d30f
      Filipe Manana 提交于
      When doing an incremental send, if we delete a directory that has N > 1
      hardlinks for the same file and that file has the highest inode number
      inside the directory contents, an incremental send would send N times an
      rmdir operation against the directory. This made the btrfs receive command
      fail on the second rmdir instruction, as the target directory didn't exist
      anymore.
      
      Steps to reproduce the issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c
          $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
          $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ rm -f /mnt/btrfs/a/b/c/foo.txt
          $ rm -f /mnt/btrfs/a/b/c/bar.txt
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      29d6d30f
    • F
      Btrfs: incremental send, fix invalid path after dir rename · 2b863a13
      Filipe Manana 提交于
      This fixes yet one more case not caught by the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      In this case, even before the initial full send, we have a directory
      which is a child of a directory with a higher inode number. Then we
      perform the initial send, and after we rename both the child and the
      parent, without moving them around. After doing these 2 renames, an
      incremental send sent a rename instruction for the child directory
      which contained an invalid "from" path (referenced the parent's old
      name, not the new one), which made the btrfs receive command fail.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b
          $ mkdir /mnt/btrfs/d
          $ mkdir /mnt/btrfs/a/b/c
          $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
          $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umout /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
        "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      2b863a13
    • F
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana 提交于
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12870f1c
    • F
      Btrfs: cleanup delayed-ref.c:find_ref_head() · 85fdfdf6
      Filipe Manana 提交于
      The argument last wasn't used, all callers supplied a NULL value
      for it. Also removed unnecessary intermediate storage of the result
      of key comparisons.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      85fdfdf6
    • F
      Btrfs: remove unnecessary ref heads rb tree search · 6103fb43
      Filipe Manana 提交于
      When we didn't find the exact ref head we were looking for, if
      return_bigger != 0 we set a new search key to match either the
      next node after the last one we found or the first one in the
      ref heads rb tree, and then did another full tree search. For both
      cases this ended up being pointless as we would end up returning
      an entry we already had before repeating the search.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6103fb43
    • J
      btrfs: wake up transaction thread upon remount · 2c6a92b0
      Justin Maggard 提交于
      Now that we can adjust the commit interval with a remount, we need
      to wake up the transaction thread or else he will continue to sleep
      until the previous transaction interval has elapsed before waking
      up.  So, if we go from a large commit interval to something smaller,
      the transaction thread will not wake up until the large interval has
      expired.  This also causes the cleaner thread to stay sleeping, since
      it gets woken up by the transaction thread.
      
      Fix it by simply waking up the transaction thread during a remount.
      Signed-off-by: NJustin Maggard <jmaggard10@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      2c6a92b0
    • M
      Btrfs: stop joining the log transaction if sync log fails · 50471a38
      Miao Xie 提交于
      If the log sync fails, there is something wrong in the log tree, we
      should not continue to join the log transaction and log the metadata.
      What we should do is to do a full commit.
      
      This patch fixes this problem by setting ->last_trans_log_full_commit
      to the current transaction id, it will tell the tasks not to join
      the log transaction, and do a full commit.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      50471a38
    • M
      Btrfs: just wait or commit our own log sub-transaction · d1433deb
      Miao Xie 提交于
      We might commit the log sub-transaction which didn't contain the metadata we
      logged. It was because we didn't record the log transid and just select
      the current log sub-transaction to commit, but the right one might be
      committed by the other task already. Actually, we needn't do anything
      and it is safe that we go back directly in this case.
      
      This patch improves the log sync by the above idea. We record the transid
      of the log sub-transaction in which we log the metadata, and the transid
      of the log sub-transaction we have committed. If the committed transid
      is >= the transid we record when logging the metadata, we just go back.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d1433deb
    • M
      Btrfs: fix skipped error handle when log sync failed · 8b050d35
      Miao Xie 提交于
      It is possible that many tasks sync the log tree at the same time, but
      only one task can do the sync work, the others will wait for it. But those
      wait tasks didn't get the result of the log sync, and returned 0 when they
      ended the wait. It caused those tasks skipped the error handle, and the
      serious problem was they told the users the file sync succeeded but in
      fact they failed.
      
      This patch fixes this problem by introducing a log context structure,
      we insert it into the a global list. When the sync fails, we will set
      the error number of every log context in the list, then the waiting tasks
      get the error number of the log context and handle the error if need.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8b050d35
    • M
      Btrfs: use signed integer instead of unsigned long integer for log transid · bb14a59b
      Miao Xie 提交于
      The log trans id is initialized to be 0 every time we create a log tree,
      and the log tree need be re-created after a new transaction is started,
      it means the log trans id is unlikely to be a huge number, so we can use
      signed integer instead of unsigned long integer to save a bit space.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bb14a59b
    • M
      Btrfs: remove unnecessary memory barrier in btrfs_sync_log() · 7483e1a4
      Miao Xie 提交于
      Mutex unlock implies certain memory barriers to make sure all the memory
      operation completes before the unlock, and the next mutex lock implies memory
      barriers to make sure the all the memory happens after the lock. So it is
      a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7483e1a4
    • M
      Btrfs: don't start the log transaction if the log tree init fails · e87ac136
      Miao Xie 提交于
      The old code would start the log transaction even the log tree init
      failed, it was unnecessary. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e87ac136
    • M
      Btrfs: fix the skipped transaction commit during the file sync · 48cab2e0
      Miao Xie 提交于
      We may abort the wait earlier if ->last_trans_log_full_commit was set to
      the current transaction id, at this case, we need commit the current
      transaction instead of the log sub-transaction. But the current code
      didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      48cab2e0
    • M
      Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit · 5c902ba6
      Miao Xie 提交于
      ->last_trans_log_full_commit may be changed by the other tasks without lock,
      so we need prevent the compiler from the optimize access just like
      	tmp = fs_info->last_trans_log_full_commit
      	if (tmp == ...)
      		...
      
      	<do something>
      
      	if (tmp == ...)
      		...
      
      In fact, we need get the new value of ->last_trans_log_full_commit during
      the second access. Fix it by ACCESS_ONCE().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5c902ba6
    • L
      Btrfs: avoid warning bomb of btrfs_invalidate_inodes · 7813b3db
      Liu Bo 提交于
      So after transaction is aborted, we need to cleanup inode resources by
      calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
      roots' refs to be zero in old times and sets a WARN_ON(), however, this
      is not always true within cleaning up transaction, so we get to detect
      transaction abortion and not warn at all.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7813b3db
    • L
      Btrfs: fix possible deadlock in btrfs_cleanup_transaction · 2a85d9ca
      Liu Bo 提交于
      [13654.480669] ======================================================
      [13654.480905] [ INFO: possible circular locking dependency detected ]
      [13654.481003] 3.12.0+ #4 Tainted: G        W  O
      [13654.481060] -------------------------------------------------------
      [13654.481060] btrfs-transacti/9347 is trying to acquire lock:
      [13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060] but task is already holding lock:
      [13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
      [13654.481060] which lock already depends on the new lock.
      
      [13654.481060] the existing dependency chain (in reverse order) is:
      [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
      [13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
      [13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
      [13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
      [13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
      [13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
      [13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
      [13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
      [13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
      [13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
      [13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
      [13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
      [13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
      [13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
      [13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
      [13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
      [13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
      [13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
      [13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
      [13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
      [13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
      [13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
      [13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
      [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
      [13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
      [13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [13654.481060] other info that might help us debug this:
      
      [13654.481060]  Possible unsafe locking scenario:
      
      [13654.481060]        CPU0                    CPU1
      [13654.481060]        ----                    ----
      [13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]
       *** DEADLOCK ***
      [...]
      
      ======================================================
      
      btrfs_destroy_all_ordered_extents()
      gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
      while btrfs_[add,remove]_ordered_extent()
      acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.
      
      This patch fixes the above problem.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      2a85d9ca
    • F
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana 提交于
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d5f37527