1. 14 3月, 2016 1 次提交
  2. 12 3月, 2016 1 次提交
  3. 18 2月, 2016 1 次提交
  4. 16 1月, 2016 1 次提交
  5. 22 11月, 2014 1 次提交
    • J
      Btrfs: do not move em to modified list when unpinning · a2804695
      Josef Bacik 提交于
      We use the modified list to keep track of which extents have been modified so we
      know which ones are candidates for logging at fsync() time.  Newly modified
      extents are added to the list at modification time, around the same time the
      ordered extent is created.  We do this so that we don't have to wait for ordered
      extents to complete before we know what we need to log.  The problem is when
      something like this happens
      
      log extent 0-4k on inode 1
      copy csum for 0-4k from ordered extent into log
      sync log
      commit transaction
      log some other extent on inode 1
      ordered extent for 0-4k completes and adds itself onto modified list again
      log changed extents
      see ordered extent for 0-4k has already been logged
      	at this point we assume the csum has been copied
      sync log
      crash
      
      On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
      which is the same one that we are replaying which also drops the csum, and then
      we won't find the csum in the log for that bytenr.  This of course causes us to
      have errors about not having csums for certain ranges of our inode.  So remove
      the modified list manipulation in unpin_extent_cache, any modified extents
      should have been added well before now, and we don't want them re-logged.  This
      fixes my test that I could reliably reproduce this problem with.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a2804695
  6. 20 6月, 2014 1 次提交
    • W
      Btrfs: fix NULL pointer crash when running balance and scrub concurrently · 298a8f9c
      Wang Shilong 提交于
      While running balance, scrub, fsstress concurrently we hit the
      following kernel crash:
      
      [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
      [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
      [56561.524325] Oops: 0000 [#1] SMP
      [....]
      [56561.527237] Call Trace:
      [56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
      [56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
      [56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
      [56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
      [56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
      [56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
      [56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
      [56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
      [56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
      [...]
      [56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.528395]  RSP <ffff88004c0f5be8>
      [56561.528454] CR2: 0000000000000078
      
      This is because in btrfs_relocate_chunk(), we will free @bdev directly while
      scrub may still hold extent mapping, and may access freed memory.
      
      Fix this problem by wrapping freeing @bdev work into free_extent_map() which
      is based on reference count.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      298a8f9c
  7. 11 3月, 2014 2 次提交
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928
      Filipe Manana 提交于
      We don't need to have an unsigned int field in the extent_map struct
      to tell us whether the extent map is in the inode's extent_map tree or
      not. We can use the rb_node struct field and the RB_CLEAR_NODE and
      RB_EMPTY_NODE macros to achieve the same task.
      
      This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
      64 bits system).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      cbc0e928
  8. 29 1月, 2014 2 次提交
    • F
      Btrfs: fix extent_map block_len after merging · d527afe1
      Filipe David Borba Manana 提交于
      When merging an extent_map with its right neighbor, increment
      its block_len with the neighbor's block_len.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d527afe1
    • F
      Btrfs: faster and more efficient extent map insertion · 32193c14
      Filipe David Borba Manana 提交于
      Before this change, adding an extent map to the extent map tree of an
      inode required 2 tree nevigations:
      
      1) doing a tree navigation to search for an existing extent map starting
         at the same offset or an extent map that overlaps the extent map we
         want to insert;
      
      2) Another tree navigation to add the extent map to the tree (if the
         former tree search didn't found anything).
      
      This change just merges these 2 steps into a single one.
      While running first few btrfs xfstests I had noticed these trees easily
      had a few hundred elements, and then with the following sysbench test it
      reached over 1100 elements very often.
      
      Test:
      
        sysbench --test=fileio --file-num=32 --file-total-size=10G \
          --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
          --max-requests=1000000 --file-io-mode=sync [prepare|run]
      
      (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
      prepare phase)
      
      Before this patch:
      
      run 1 - 41.894Mb/sec
      run 2 - 40.527Mb/sec
      run 3 - 40.922Mb/sec
      run 4 - 49.433Mb/sec
      run 5 - 40.959Mb/sec
      
      average - 42.75Mb/sec
      
      After this patch:
      
      run 1 - 48.036Mb/sec
      run 2 - 50.21Mb/sec
      run 3 - 50.929Mb/sec
      run 4 - 46.881Mb/sec
      run 5 - 53.192Mb/sec
      
      average - 49.85Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      32193c14
  9. 07 5月, 2013 2 次提交
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: fix bad extent logging · 09a2a8f9
      Josef Bacik 提交于
      A user sent me a btrfs-image of a file system that was panicing on mount during
      the log recovery.  I had originally thought these problems were from a bug in
      the free space cache code, but that was just a symptom of the problem.  The
      problem is if your application does something like this
      
      [prealloc][prealloc][prealloc]
      
      the internal extent maps will merge those all together into one extent map, even
      though on disk they are 3 separate extents.  So if you go to write into one of
      these ranges the extent map will be right since we use the physical extent when
      doing the write, but when we log the extents they will use the wrong sizes for
      the remainder prealloc space.  If this doesn't happen to trip up the free space
      cache (which it won't in a lot of cases) then you will get bogus entries in your
      extent tree which will screw stuff up later.  The data and such will still work,
      but everything else is broken.  This patch fixes this by not allowing extents
      that are on the modified list to be merged.  This has the side effect that we
      are no longer adding everything to the modified list all the time, which means
      we now have to call btrfs_drop_extents every time we log an extent into the
      tree.  So this allows me to drop all this speciality code I was using to get
      around calling btrfs_drop_extents.  With this patch the testcase I've created no
      longer creates a bogus file system after replaying the log.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      09a2a8f9
  10. 02 3月, 2013 1 次提交
  11. 06 2月, 2013 1 次提交
    • J
      Btrfs: do not merge logged extents if we've removed them from the tree · 222c81dc
      Josef Bacik 提交于
      You can run into this problem where if somebody is fsyncing and writing out
      the existing extents you will have removed the extent map from the em tree,
      but it's still valid for the current fsync so we go ahead and write it.  The
      problem is we unconditionally try to merge it back into the em tree, but if
      we've removed it from the em tree that will cause use after free problems.
      Fix this to only merge if we are still a part of the tree.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      222c81dc
  12. 25 1月, 2013 1 次提交
    • J
      Btrfs: do not allow logged extents to be merged or removed · 201a9038
      Josef Bacik 提交于
      We drop the extent map tree lock while we're logging extents, so somebody
      could come in and merge another extent into this one and screw up our
      logging, or they could even remove us from the list which would keep us from
      logging the extent or freeing our ref on it, so we need to make sure to not
      clear LOGGING until after the extent is logged, and then we can merge it to
      adjacent extents.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      201a9038
  13. 17 12月, 2012 2 次提交
  14. 30 10月, 2012 1 次提交
  15. 04 10月, 2012 1 次提交
    • J
      Btrfs: do not hold the write_lock on the extent tree while logging · ff44c6e3
      Josef Bacik 提交于
      Dave Sterba pointed out a sleeping while atomic bug while doing fsync.  This
      is because I'm an idiot and didn't realize that rwlock's were spin locks, so
      we've been holding this thing while doing allocations and such which is not
      good.  This patch fixes this by dropping the write lock before we do
      anything heavy and re-acquire it when it is done.  We also need to take a
      ref on the em's in case their corresponding pages are evicted and mark them
      as being logged so that releasepage does not remove them and doesn't remove
      them from our local list.  Thanks,
      Reported-by: NDave Sterba <dave@jikos.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ff44c6e3
  16. 02 10月, 2012 3 次提交
    • D
      btrfs: polish names of kmem caches · 837e1972
      David Sterba 提交于
      Usecase:
      
        watch 'grep btrfs < /proc/slabinfo'
      
      easy to watch all caches in one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      837e1972
    • L
      Btrfs: improve fsync by filtering extents that we want · 4e2f84e6
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The above Josef's patch performs very good in random sync write test,
      because we won't have too much extents to merge.
      
      However, it does not performs good on the test:
      dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
      
      The reason is when we do sequencial sync write, we need to merge the
      current extent just with the previous one, so that we can get accumulated
      extents to log:
      
      A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
      
      So we'll have to flush more and more checksum into log tree, which is the
      bottleneck according to my tests.
      
      But we can avoid this by telling fsync the real extents that are needed
      to be logged.
      
      With this, I did the above dd sync write test (size=50m),
      
               w/o (orig)   w/ (josef's)   w/ (this)
      SATA      104KB/s       109KB/s       121KB/s
      ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      4e2f84e6
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
  17. 02 8月, 2011 3 次提交
  18. 02 5月, 2011 2 次提交
  19. 31 3月, 2011 1 次提交
  20. 15 2月, 2011 1 次提交
  21. 22 12月, 2010 1 次提交
  22. 30 10月, 2010 1 次提交
  23. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  24. 09 3月, 2010 1 次提交
  25. 29 1月, 2010 1 次提交
  26. 04 12月, 2009 1 次提交
  27. 12 11月, 2009 1 次提交
  28. 19 9月, 2009 1 次提交
  29. 12 9月, 2009 2 次提交
    • C
      Btrfs: Fix extent replacment race · a1ed835e
      Chris Mason 提交于
      Data COW means that whenever we write to a file, we replace any old
      extent pointers with new ones.  There was a window where a readpage
      might find the old extent pointers on disk and cache them in the
      extent_map tree in ram in the middle of a given write replacing them.
      
      Even though both the readpage and the write had their respective bytes
      in the file locked, the extent readpage inserts may cover more bytes than
      it had locked down.
      
      This commit closes the race by keeping the new extent pinned in the extent
      map tree until after the on-disk btree is properly setup with the new
      extent pointers.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a1ed835e
    • C
      Btrfs: switch extent_map to a rw lock · 890871be
      Chris Mason 提交于
      There are two main users of the extent_map tree.  The
      first is regular file inodes, where it is evenly spread
      between readers and writers.
      
      The second is the chunk allocation tree, which maps blocks from
      logical addresses to phyiscal ones, and it is 99.99% reads.
      
      The mapping tree is a point of lock contention during heavy IO
      workloads, so this commit switches things to a rw lock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      890871be
  30. 25 4月, 2009 1 次提交