1. 21 11月, 2014 1 次提交
    • G
      btrfs: fix dead lock while running replace and defrag concurrently · 32159242
      Gui Hecheng 提交于
      This can be reproduced by fstests: btrfs/070
      
      The scenario is like the following:
      
      replace worker thread		defrag thread
      ---------------------		-------------
      copy_nocow_pages_worker		btrfs_defrag_file
        copy_nocow_pages_for_inode	    ...
      				  btrfs_writepages
        |A| lock_extent_bits		    extent_write_cache_pages
      				|B|   lock_page
      					__extent_writepage
      		...			  writepage_delalloc
      					    find_lock_delalloc_range
      				|B| 	      lock_extent_bits
        find_or_create_page
          pagecache_get_page
        |A| lock_page
      
      This leads to an ABBA pattern deadlock. To fix it,
      o we just change it to an AABB pattern which means to @unlock_extent_bits()
        before we @lock_page(), and in this way the @extent_read_full_page_nolock()
        is no longer in an locked context, so change it back to @extent_read_full_page()
        to regain protection.
      
      o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
        the extent may not really point at the physical block we want, so we
        have to check it before write.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      32159242
  2. 02 10月, 2014 1 次提交
  3. 18 9月, 2014 7 次提交
  4. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  5. 19 8月, 2014 1 次提交
  6. 20 6月, 2014 1 次提交
  7. 10 6月, 2014 2 次提交
  8. 11 4月, 2014 1 次提交
  9. 08 4月, 2014 1 次提交
    • W
      Btrfs: scrub raid56 stripes in the right way · 3b080b25
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
       # mount /dev/sda8 /mnt
       # btrfs scrub start -BR /mnt
       # echo $? <--unverified errors make return value be 3
      
      This is because we don't setup right mapping between physical
      and logical address for raid56, which makes checksum mismatch.
      But we will find everthing is fine later when rechecking using
      btrfs_map_block().
      
      This patch fixed the problem by settuping right mappings and
      we only verify data stripes' checksums.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3b080b25
  10. 11 3月, 2014 4 次提交
  11. 29 1月, 2014 6 次提交
  12. 25 11月, 2013 1 次提交
  13. 24 11月, 2013 2 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
    • K
      block: submit_bio_wait() conversions · 33879d45
      Kent Overstreet 提交于
      It was being open coded in a few places.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      33879d45
  14. 21 11月, 2013 1 次提交
  15. 12 11月, 2013 3 次提交
  16. 21 9月, 2013 1 次提交
    • J
      Btrfs: improve replacing nocow extents · 652f25a2
      Josef Bacik 提交于
      Various people have hit a deadlock when running btrfs/011.  This is because when
      replacing nocow extents we will take the i_mutex to make sure nobody messes with
      the file while we are replacing the extent.  The problem is we are already
      holding a transaction open, which is a locking inversion, so instead we need to
      save these inodes we find and then process them outside of the transaction.
      
      Further we can't just lock the inode and assume we are good to go.  We need to
      lock the extent range and then read back the extent cache for the inode to make
      sure the extent really still points at the physical block we want.  If it
      doesn't we don't have to copy it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      652f25a2
  17. 01 9月, 2013 5 次提交
  18. 20 7月, 2013 1 次提交
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb