1. 30 5月, 2016 1 次提交
    • F
      Btrfs: fix race between readahead and device replace/removal · ce7791ff
      Filipe Manana 提交于
      The list of devices is protected by the device_list_mutex and the device
      replace code, in its finishing phase correctly takes that mutex before
      removing the source device from that list. However the readahead code was
      iterating that list without acquiring the respective mutex leading to
      crashes later on due to invalid memory accesses:
      
      [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP
      [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
      processor ser
      [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G        W       4.6.0-rc7-btrfs-next-29+ #1
      [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
      [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000
      [125671.834973] RIP: 0010:[<ffffffff81270479>]  [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105
      [125671.834973] RSP: 0018:ffff8801ac91bc28  EFLAGS: 00010206
      [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000
      [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8
      [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000
      [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8
      [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff
      [125671.834973] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
      [125671.834973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0
      [125671.834973] Stack:
      [125671.834973]  0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000
      [125671.834973]  0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000
      [125671.834973]  ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0
      [125671.834973] Call Trace:
      [125671.834973]  [<ffffffff81270541>] radix_tree_lookup+0xd/0xf
      [125671.834973]  [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs]
      [125671.834973]  [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs]
      [125671.834973]  [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs]
      [125671.834973]  [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs]
      [125671.834973]  [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs]
      [125671.834973]  [<ffffffff81069691>] process_one_work+0x271/0x4e9
      [125671.834973]  [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9
      [125671.834973]  [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3
      [125671.834973]  [<ffffffff8106f403>] kthread+0xd4/0xdc
      [125671.834973]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
      [125671.834973]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
      
      So fix this by taking the device_list_mutex in the readahead code. We
      can't use here the lighter approach of using a rcu_read_lock() and
      rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
      because we end up doing calls to sleeping functions (kzalloc()) in the
      respective code path.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      ce7791ff
  2. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  4. 18 2月, 2016 13 次提交
  5. 16 2月, 2016 4 次提交
    • Z
      btrfs: reada: Avoid many times of empty loop · 97d5f0e6
      Zhao Lei 提交于
      We can see following loop(10000 times) in trace_log:
       [   75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [   75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
       [   75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [   75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
      
       [   75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [   75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
       [   75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [   75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
      
       ...(10000 times)
      
       [  124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [  124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
       [  124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
       [  124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
      
      Reason:
       If more than one user trigger reada in same extent, the first task
       finished setting of reada data struct and call reada_start_machine()
       to start, and the second task only add a ref_count but have not
       add reada_extctl struct completely, the reada_extent can not finished
       all jobs, and will be selected in __reada_start_machine() for 10000
       times(total times in __reada_start_machine()).
      
      Fix:
       For a reada_extent without job, we don't need to run it, just return
       0 to let caller break.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97d5f0e6
    • Z
      btrfs: reada: Add missed segment checking in reada_find_zone · 8e9aa51f
      Zhao Lei 提交于
      In rechecking zone-in-tree, we still need to check zone include
      our logical address.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e9aa51f
    • Z
      btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone · c37f49c7
      Zhao Lei 提交于
      We can avoid additional locking-acquirment and one pair of
      kref_get/put by combine two condition.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c37f49c7
    • Z
      btrfs: reada: Fix in-segment calculation for reada · 50378530
      Zhao Lei 提交于
      reada_zone->end is end pos of segment:
       end = start + cache->key.offset - 1;
      
      So we need to use "<=" in condition to judge is a pos in the
      segment.
      
      The problem happened rearly, because logical pos rarely pointed
      to last 4k of a blockgroup, but we need to fix it to make code
      right in logic.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50378530
  6. 11 2月, 2016 1 次提交
    • D
      btrfs: reada: use GFP_KERNEL everywhere · ed0244fa
      David Sterba 提交于
      The readahead framework is not on the critical writeback path we don't
      need to use GFP_NOFS for allocations. All error paths are handled and
      the readahead failures are not fatal. The actual users (scrub,
      dev-replace) will trigger reads if the blocks are not found in cache.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed0244fa
  7. 22 10月, 2015 1 次提交
  8. 09 8月, 2015 1 次提交
    • O
      Btrfs: count devices correctly in readahead during RAID 5/6 replace · 7cb2c420
      Omar Sandoval 提交于
      Commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6
      degraded mounting") fixed a problem where we would skip a missing device
      when we shouldn't have because there are no other mirrors to read from
      in RAID 5/6. After commit 2c8cdd6e ("Btrfs, replace: write dirty
      pages into the replace target device"), the fix doesn't work when we're
      doing a missing device replace on RAID 5/6 because the replace device is
      counted as a mirror so we're tricked into thinking we can safely skip
      the missing device. The fix is to count only the real stripes and decide
      based on that.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7cb2c420
  9. 22 1月, 2015 1 次提交
  10. 13 12月, 2014 2 次提交
  11. 18 9月, 2014 1 次提交
  12. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  13. 14 6月, 2014 1 次提交
    • W
      Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting · 5fbc7c59
      Wang Shilong 提交于
      Steps to reproduce:
      
       # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
       # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
       # mount /dev/sdb /mnt -o degraded
       # btrfs scrub start -BRd /mnt
      
      This is because readahead would skip missing device, this is not true
      for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
      mapping. If expected data locates in missing device, readahead thread
      would not call __readahead_hook() which makes event @rc->elems=0
      wait forever.
      
      Fix this problem by checking return value of btrfs_map_block(),we
      can only skip missing device safely if there are several mirrors.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5fbc7c59
  14. 11 3月, 2014 2 次提交
  15. 29 1月, 2014 1 次提交
  16. 07 5月, 2013 1 次提交
  17. 13 12月, 2012 4 次提交
    • S
      Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() · 29a8d9a0
      Stefan Behrens 提交于
      Before this commit, btrfs_map_block() was called with REQ_WRITE
      in order to retrieve the list of mirrors for a disk block.
      This needs to be changed for the device replace procedure since
      it makes a difference whether you are asking for read mirrors
      or for locations to write to.
      GET_READ_MIRRORS is introduced as a new interface to call
      btrfs_map_block().
      In the current commit, the functionality is not yet changed,
      only the interface for GET_READ_MIRRORS is introduced and all
      the places that should use this new interface are adapted.
      
      The reason that REQ_WRITE cannot be abused anymore to retrieve
      a list of read mirrors is that during a running dev replace
      operation all write requests to the live filesystem are
      duplicated to also write to the target drive.
      Keep in mind that the target disk is only partially a valid
      copy of the source disk while the operation is ongoing. All
      writes go to the target disk, but not all reads would return
      valid data on the target disk. Therefore it is not possible
      anymore to abuse a REQ_WRITE interface to find valid mirrors
      for a REQ_READ.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      29a8d9a0
    • S
      Btrfs: change core code of btrfs to support the device replace operations · 8dabb742
      Stefan Behrens 提交于
      This commit contains all the essential changes to the core code
      of Btrfs for support of the device replace procedure.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8dabb742
    • S
      Btrfs: add code to scrub to copy read data to another disk · ff023aac
      Stefan Behrens 提交于
      The device replace procedure makes use of the scrub code. The scrub
      code is the most efficient code to read the allocated data of a disk,
      i.e. it reads sequentially in order to avoid disk head movements, it
      skips unallocated blocks, it uses read ahead mechanisms, and it
      contains all the code to detect and repair defects.
      This commit adds code to scrub to allow the scrub code to copy read
      data to another disk.
      One goal is to be able to perform as fast as possible. Therefore the
      write requests are collected until huge bios are built, and the
      write process is decoupled from the read process with some kind of
      flow control, of course, in order to limit the allocated memory.
      The best performance on spinning disks could by reached when the
      head movements are avoided as much as possible. Therefore a single
      worker is used to interface the read process with the write process.
      The regular scrub operation works as fast as before, it is not
      negatively influenced and actually it is more or less unchanged.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ff023aac
    • S
      Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree · 3ec706c8
      Stefan Behrens 提交于
      This is required for the device replace procedure in a later step.
      Two calling functions also had to be changed to have the fs_info
      pointer: repair_io_failure() and scrub_setup_recheck_block().
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      3ec706c8
  18. 03 10月, 2012 1 次提交
  19. 30 5月, 2012 1 次提交
  20. 19 4月, 2012 1 次提交