1. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  2. 19 8月, 2014 1 次提交
  3. 20 6月, 2014 1 次提交
  4. 10 6月, 2014 2 次提交
  5. 11 4月, 2014 1 次提交
  6. 08 4月, 2014 1 次提交
    • W
      Btrfs: scrub raid56 stripes in the right way · 3b080b25
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
       # mount /dev/sda8 /mnt
       # btrfs scrub start -BR /mnt
       # echo $? <--unverified errors make return value be 3
      
      This is because we don't setup right mapping between physical
      and logical address for raid56, which makes checksum mismatch.
      But we will find everthing is fine later when rechecking using
      btrfs_map_block().
      
      This patch fixed the problem by settuping right mappings and
      we only verify data stripes' checksums.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3b080b25
  7. 11 3月, 2014 4 次提交
  8. 29 1月, 2014 6 次提交
  9. 25 11月, 2013 1 次提交
  10. 24 11月, 2013 2 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
    • K
      block: submit_bio_wait() conversions · 33879d45
      Kent Overstreet 提交于
      It was being open coded in a few places.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      33879d45
  11. 21 11月, 2013 1 次提交
  12. 12 11月, 2013 3 次提交
  13. 21 9月, 2013 1 次提交
    • J
      Btrfs: improve replacing nocow extents · 652f25a2
      Josef Bacik 提交于
      Various people have hit a deadlock when running btrfs/011.  This is because when
      replacing nocow extents we will take the i_mutex to make sure nobody messes with
      the file while we are replacing the extent.  The problem is we are already
      holding a transaction open, which is a locking inversion, so instead we need to
      save these inodes we find and then process them outside of the transaction.
      
      Further we can't just lock the inode and assume we are good to go.  We need to
      lock the extent range and then read back the extent cache for the inode to make
      sure the extent really still points at the physical block we want.  If it
      doesn't we don't have to copy it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      652f25a2
  14. 01 9月, 2013 5 次提交
  15. 20 7月, 2013 1 次提交
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb
  16. 02 7月, 2013 4 次提交
    • M
      Btrfs: fix several potential problems in copy_nocow_pages_for_inode · edd1400b
      Miao Xie 提交于
      - It makes no sense that we deal with a inode in the dead tree.
      - fix the race between dio and page copy by waiting the dio completion
      - avoid the page copy vs truncate/punch hole
      - check if the page is in the page cache or not
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      edd1400b
    • M
      Btrfs: cleanup the code of copy_nocow_pages_for_inode() · 826aa0a8
      Miao Xie 提交于
      - It make no sense that we continue to do something after the error
        happened, just go back with this patch.
      - remove some check of copy_nocow_pages_for_inode(), such as page check
        after write, inode check in the end of the function, because we are
        sure they exist.
      - remove the unnecessary goto in the return value check of the write
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      826aa0a8
    • M
      Btrfs: fix oops when recovering the file data by scrub function · 26b25891
      Miao Xie 提交于
      We get oops while running btrfs replace start test,
      ------------[ cut here ]------------
      kernel BUG at mm/filemap.c:608!
      [SNIP]
      Call Trace:
        [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
        [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
        [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
        [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
        [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
        [<ffffffff8106f2f0>] kthread+0xc0/0xd0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
        [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
      [SNIP]
       RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
        RSP <ffff88010316bb98>
       ---[ end trace 421e79ad0dd72c7d ]---
      
      it is because we forgot to lock the page again after we read data to
      the page. Fix it.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      26b25891
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  17. 01 7月, 2013 1 次提交
  18. 18 5月, 2013 1 次提交
    • C
      Btrfs: use a btrfs bioset instead of abusing bio internals · 9be3395b
      Chris Mason 提交于
      Btrfs has been pointer tagging bi_private and using bi_bdev
      to store the stripe index and mirror number of failed IOs.
      
      As bios bubble back up through the call chain, we use these
      to decide if and how to retry our IOs.  They are also used
      to count IO failures on a per device basis.
      
      Recently a bio tracepoint was added lead to crashes because
      we were abusing bi_bdev.
      
      This commit adds a btrfs bioset, and creates explicit fields
      for the mirror number and stripe index.  The plan is to
      extend this structure for all of the fields currently in
      struct btrfs_bio, which will mean one less kmalloc in
      our IO path.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NTejun Heo <tj@kernel.org>
      9be3395b
  19. 07 5月, 2013 3 次提交
    • L
      Btrfs: improve the loop of scrub_stripe · 625f1c8d
      Liu Bo 提交于
      1) Right now scrub_stripe() is looping in some unnecessary cases:
      * when the found extent item's objectid has been out of the dev extent's range
        but we haven't finish scanning all the range within the dev extent
      * when all the items has been processed but we haven't finish scanning all the
        range within the dev extent
      
      In both cases, we can just finish the loop to save costs.
      
      2) Besides, when the found extent item's length is larger than the stripe
      len(64k), we don't have to release the path and search again as it'll get at the
      same key used in the last loop, we can instead increase the logical cursor in
      place till all space of the extent is scanned.
      
      3) And we use 0 as the key's offset to search btree, then get to previous item
      to find a smaller item, and again have to move to the next one to get the right
      item.  Setting offset=-1 and previous_item() is the correct way.
      
      4) As we won't find any checksum at offset unless this 'offset' is in a data
      extent, we can just find checksum when we're really going to scrub an extent.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      625f1c8d
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f