1. 04 3月, 2015 2 次提交
  2. 15 2月, 2015 1 次提交
    • F
      Btrfs: scrub, fix sleep in atomic context · f55985f4
      Filipe Manana 提交于
      My previous patch "Btrfs: fix scrub race leading to use-after-free"
      introduced the possibility to sleep in an atomic context, which happens
      when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
      is called - this function can be called under an atomic context.
      Chris ran into this in a debug kernel which gave the following trace:
      
      [ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621
      [ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
      [ 1928.981324] INFO: lockdep is turned off.
      [ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G        W     3.19.0-rc7-mason+ #41
      [ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 05/10/2012
      [ 1929.022207]  ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78
      [ 1929.037267]  ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8
      [ 1929.052315]  0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8
      [ 1929.067381] Call Trace:
      [ 1929.072344]  <IRQ>  [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e
      [ 1929.083968]  [<ffffffff810856c4>] ___might_sleep+0x174/0x230
      [ 1929.095352]  [<ffffffff810857d2>] __might_sleep+0x52/0x90
      [ 1929.106223]  [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0
      [ 1929.117951]  [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10
      [ 1929.129708]  [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs]
      [ 1929.143370]  [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs]
      [ 1929.157191]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
      [ 1929.167382]  [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs]
      [ 1929.180161]  [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs]
      [ 1929.194318]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
      [ 1929.204496]  [<ffffffff8130401b>] blk_update_request+0x1eb/0x450
      [ 1929.216569]  [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500
      [ 1929.229176]  [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0
      [ 1929.240740]  [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0
      [ 1929.252654]  [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150
      [ 1929.264725]  [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170
      [ 1929.276635]  [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0
      [ 1929.288014]  [<ffffffff8105d92e>] __do_softirq+0xde/0x600
      [ 1929.298885]  [<ffffffff8105df6d>] irq_exit+0xbd/0xd0
      (...)
      
      Fix this by using a reference count on the scrub context structure
      instead of locking the scrub_lock mutex.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f55985f4
  3. 03 2月, 2015 1 次提交
    • F
      Btrfs: fix scrub race leading to use-after-free · de554a4f
      Filipe Manana 提交于
      While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
      the following trace:
      
      [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
      [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
      [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
      [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
      [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
      [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
      [68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
      [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
      [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
      [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
      [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
      [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
      [68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
      [68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
      [68127.807663] Stack:
      [68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
      [68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
      [68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
      [68127.807663] Call Trace:
      [68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
      [68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
      [68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
      [68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
      [68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
      [68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
      [68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
      [68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
      [68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
      [68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
      [68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
      [68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
      [68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663]  RSP <ffff8803f097fbb8>
      [68127.807663] CR2: ffff8803f8947a50
      [68127.807663] ---[ end trace d7045aac00a66cd8 ]---
      
      This is due to a race that can happen in a very tiny time window and is
      illustrated by the following sequence diagram:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             atomic_dec(&sctx->bios_in_flight)
                                                                         wait sctx->bios_in_flight == 0
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
             wake_up(&sctx->list_wait)
                __wake_up()
                    spin_lock_irqsave(&sctx->list_wait->lock, flags)
      
      Another variation of this scenario that results in the same use-after-free
      issue is:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
                                                                         wait sctx->bios_in_flight == 0
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             __wake_up(&sctx->list_wait)
                spin_lock_irqsave(&sctx->list_wait->lock, flags)
                default_wake_function()
                    wake up task at CPU 2
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
                spin_unlock_irqrestore(&sctx->list_wait->lock, flags)
      
      Fix this by holding the scrub lock while doing the wakeup.
      
      This isn't a recent regression, the issue as been around since the scrub
      feature was added (2011, commit a2de733c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de554a4f
  4. 22 1月, 2015 12 次提交
  5. 15 1月, 2015 2 次提交
  6. 03 1月, 2015 1 次提交
  7. 03 12月, 2014 4 次提交
    • M
      Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 · 4245215d
      Miao Xie 提交于
      The commit c404e0dc (Btrfs: fix use-after-free in the finishing
      procedure of the device replace) fixed a use-after-free problem
      which happened when removing the source device at the end of device
      replace, but at that time, btrfs didn't support device replace
      on raid56, so we didn't fix the problem on the raid56 profile.
      Currently, we implemented device replace for raid56, so we need
      kick that problem out before we enable that function for raid56.
      
      The fix method is very simple, we just increase the bio per-cpu
      counter before we submit a raid56 io, and decrease the counter
      when the raid56 io ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      4245215d
    • M
      Btrfs, replace: write raid56 parity into the replace target device · 76035976
      Miao Xie 提交于
      This function reused the code of parity scrub, and we just write
      the right parity or corrected parity into the target device before
      the parity scrub end.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      76035976
    • M
      Btrfs, raid56: support parity scrub on raid56 · 5a6ac9ea
      Miao Xie 提交于
      The implementation is:
      - Read and check all the data with checksum in the same stripe.
        All the data which has checksum is COW data, and we are sure
        that it is not changed though we don't lock the stripe. because
        the space of that data just can be reclaimed after the current
        transction is committed, and then the fs can use it to store the
        other data, but when doing scrub, we hold the current transaction,
        that is that data can not be recovered, it is safe that read and check
        it out of the stripe lock.
      - Lock the stripe
      - Read out all the data without checksum and parity
        The data without checksum and the parity may be changed if we don't
        lock the stripe, so we need read it in the stripe lock context.
      - Check the parity
      - Re-calculate the new parity and write back it if the old parity
        is not right
      - Unlock the stripe
      
      If we can not read out the data or the data we read is corrupted,
      we will try to repair it. If the repair fails. we will mark the
      horizontal sub-stripe(pages on the same horizontal) as corrupted
      sub-stripe, and we will skip the parity check and repair of that
      horizontal sub-stripe.
      
      And in order to skip the horizontal sub-stripe that has no data, we
      introduce a bitmap. If there is some data on the horizontal sub-stripe,
      we will the relative bit to 1, and when we check and repair the
      parity, we will skip those horizontal sub-stripes that the relative
      bits is 0.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      5a6ac9ea
    • M
      Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted · af8e2d1d
      Miao Xie 提交于
      This patch implement the RAID5/6 common data repair function, the
      implementation is similar to the scrub on the other RAID such as
      RAID1, the differentia is that we don't read the data from the
      mirror, we use the data repair function of RAID5/6.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      af8e2d1d
  8. 21 11月, 2014 1 次提交
    • G
      btrfs: fix dead lock while running replace and defrag concurrently · 32159242
      Gui Hecheng 提交于
      This can be reproduced by fstests: btrfs/070
      
      The scenario is like the following:
      
      replace worker thread		defrag thread
      ---------------------		-------------
      copy_nocow_pages_worker		btrfs_defrag_file
        copy_nocow_pages_for_inode	    ...
      				  btrfs_writepages
        |A| lock_extent_bits		    extent_write_cache_pages
      				|B|   lock_page
      					__extent_writepage
      		...			  writepage_delalloc
      					    find_lock_delalloc_range
      				|B| 	      lock_extent_bits
        find_or_create_page
          pagecache_get_page
        |A| lock_page
      
      This leads to an ABBA pattern deadlock. To fix it,
      o we just change it to an AABB pattern which means to @unlock_extent_bits()
        before we @lock_page(), and in this way the @extent_read_full_page_nolock()
        is no longer in an locked context, so change it back to @extent_read_full_page()
        to regain protection.
      
      o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
        the extent may not really point at the physical block we want, so we
        have to check it before write.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      32159242
  9. 02 10月, 2014 1 次提交
  10. 18 9月, 2014 7 次提交
  11. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  12. 19 8月, 2014 1 次提交
  13. 20 6月, 2014 1 次提交
  14. 10 6月, 2014 2 次提交
  15. 11 4月, 2014 1 次提交
  16. 08 4月, 2014 1 次提交
    • W
      Btrfs: scrub raid56 stripes in the right way · 3b080b25
      Wang Shilong 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
       # mount /dev/sda8 /mnt
       # btrfs scrub start -BR /mnt
       # echo $? <--unverified errors make return value be 3
      
      This is because we don't setup right mapping between physical
      and logical address for raid56, which makes checksum mismatch.
      But we will find everthing is fine later when rechecking using
      btrfs_map_block().
      
      This patch fixed the problem by settuping right mappings and
      we only verify data stripes' checksums.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3b080b25
  17. 11 3月, 2014 1 次提交