1. 01 9月, 2015 1 次提交
  2. 10 6月, 2015 1 次提交
    • Z
      btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() · 20b2e302
      Zhao Lei 提交于
      lockdep report following warning in test:
       [25176.843958] =================================
       [25176.844519] [ INFO: inconsistent lock state ]
       [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
       [25176.845591] ---------------------------------
       [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
       [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
       [25176.847838] {SOFTIRQ-ON-W} state was registered at:
       [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
       [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
       [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
       [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
       [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
       [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
       [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
       [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
       [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
       [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
       [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
       [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
       [25176.854935] irq event stamp: 51506
       [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
       [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
       [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
       [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
       [25176.857746]
       other info that might help us debug this:
       [25176.858845]  Possible unsafe locking scenario:
       [25176.859981]        CPU0
       [25176.860537]        ----
       [25176.861059]   lock(&wr_ctx->wr_lock);
       [25176.861705]   <Interrupt>
       [25176.862272]     lock(&wr_ctx->wr_lock);
       [25176.862881]
        *** DEADLOCK ***
      
      Reason:
       Above warning is caused by:
       Interrupt
       -> bio_endio()
       -> ...
       -> scrub_put_ctx()
       -> scrub_free_ctx() *1
       -> ...
       -> mutex_lock(&wr_ctx->wr_lock);
      
       scrub_put_ctx() is allowed to be called in end_bio interrupt, but
       in code design, it will never call scrub_free_ctx(sctx) in interrupe
       context(above *1), because btrfs_scrub_dev() get one additional
       reference of sctx->refs, which makes scrub_free_ctx() only called
       withine btrfs_scrub_dev().
      
       Now the code runs out of our wish, because free sequence in
       scrub_pending_bio_dec() have a gap.
      
       Current code:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
       -> scrub_free_ctx()                |
       -----------------------------------+-----------------------------------
      
       We expected:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
                                          | -> scrub_free_ctx()
       -----------------------------------+-----------------------------------
      
      Fix:
       Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
       in interrupt context.
       Tested by check tracelog in debug.
      
      Changelog v1->v2:
       Use workqueue instead of adjust function call sequence in v1,
       because v1 will introduce a bug pointed out by:
       Filipe David Manana <fdmanana@gmail.com>
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      20b2e302
  3. 17 2月, 2015 1 次提交
  4. 18 9月, 2014 1 次提交
    • M
      Btrfs: implement repair function when direct read fails · 8b110e39
      Miao Xie 提交于
      This patch implement data repair function when direct read fails.
      
      The detail of the implementation is:
      - When we find the data is not right, we try to read the data from the other
        mirror.
      - When the io on the mirror ends, we will insert the endio work into the
        dedicated btrfs workqueue, not common read endio workqueue, because the
        original endio work is still blocked in the btrfs endio workqueue, if we
        insert the endio work of the io on the mirror into that workqueue, deadlock
        would happen.
      - After we get right data, we write it back to the corrupted mirror.
      - And if the data on the new mirror is still corrupted, we will try next
        mirror until we read right data or all the mirrors are traversed.
      - After the above work, we set the uptodate flag according to the result.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8b110e39
  5. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  6. 21 3月, 2014 1 次提交
  7. 11 3月, 2014 6 次提交
  8. 05 10月, 2013 1 次提交
    • I
      Btrfs: eliminate races in worker stopping code · 964fb15a
      Ilya Dryomov 提交于
      The current implementation of worker threads in Btrfs has races in
      worker stopping code, which cause all kinds of panics and lockups when
      running btrfs/011 xfstest in a loop.  The problem is that
      btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
      check_busy_worker and __btrfs_start_workers.
      
      E.g., check_idle_worker race flow:
      
             btrfs_stop_workers():            check_idle_worker(aworker):
      - grabs the lock
      - splices the idle list into the
        working list
      - removes the first worker from the
        working list
      - releases the lock to wait for
        its kthread's completion
                                        - grabs the lock
                                        - if aworker is on the working list,
                                          moves aworker from the working list
                                          to the idle list
                                        - releases the lock
      - grabs the lock
      - puts the worker
      - removes the second worker from the
        working list
                                    ......
              btrfs_stop_workers returns, aworker is on the idle list
                       FS is umounted, memory is freed
                                    ......
                    aworker is waken up, fireworks ensue
      
      With this applied, I wasn't able to trigger the problem in 48 hours,
      whereas previously I could reliably reproduce at least one of these
      races within an hour.
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      964fb15a
  9. 22 3月, 2012 1 次提交
  10. 16 12月, 2011 1 次提交
    • J
      Btrfs: fix num_workers_starting bug and other bugs in async thread · 0dc3b84a
      Josef Bacik 提交于
      Al pointed out we have some random problems with the way we account for
      num_workers_starting in the async thread stuff.  First of all we need to make
      sure to decrement num_workers_starting if we fail to start the worker, so make
      __btrfs_start_workers do this.  Also fix __btrfs_start_workers so that it
      doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
      failed to create a worker.  Also check_pending_worker_creates needs to call
      __btrfs_start_work in it's work function since it already increments
      num_workers_starting.
      
      People only start one worker at a time, so get rid of the num_workers argument
      everywhere, and make btrfs_queue_worker a void since it will always succeed.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0dc3b84a
  11. 05 10月, 2009 1 次提交
    • C
      Btrfs: fix deadlock on async thread startup · 61d92c32
      Chris Mason 提交于
      The btrfs async worker threads are used for a wide variety of things,
      including processing bio end_io functions.  This means that when
      the endio threads aren't running, the rest of the FS isn't
      able to do the final processing required to clear PageWriteback.
      
      The endio threads also try to exit as they become idle and
      start more as the work piles up.  The problem is that starting more
      threads means kthreadd may need to allocate ram, and that allocation
      may wait until the global number of writeback pages on the system is
      below a certain limit.
      
      The result of that throttling is that end IO threads wait on
      kthreadd, who is waiting on IO to end, which will never happen.
      
      This commit fixes the deadlock by handing off thread startup to a
      dedicated thread.  It also fixes a bug where the on-demand thread
      creation was creating far too many threads because it didn't take into
      account threads being started by other procs.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      61d92c32
  12. 12 9月, 2009 2 次提交
    • C
      Btrfs: keep irqs on more often in the worker threads · 4e3f9c50
      Chris Mason 提交于
      The btrfs worker thread spinlock was being used both for the
      queueing of IO and for the processing of ordered events.
      
      The ordered events never happen from end_io handlers, and so they
      don't need to use the _irq version of spinlocks.  This adds a
      dedicated lock to the ordered lists so they don't have to run
      with irqs off.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4e3f9c50
    • C
      Btrfs: Allow worker threads to exit when idle · 9042846b
      Chris Mason 提交于
      The Btrfs worker threads don't currently die off after they have
      been idle for a while, leading to a lot of threads sitting around
      doing nothing for each mount.
      
      Also, they are unable to start atomically (from end_io hanlders).
      
      This commit reworks the worker threads so they can be started
      from end_io handlers (just setting a flag that asks for a thread
      to be added at a later date) and so they can exit if they
      have been idle for a long time.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9042846b
  13. 21 4月, 2009 1 次提交
    • C
      Btrfs: add a priority queue to the async thread helpers · d313d7a3
      Chris Mason 提交于
      Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
      higher priority.  But, the checksumming helper threads prevent it
      from being fully effective.
      
      There are two problems.  First, a big queue of pending checksumming
      will delay the synchronous IO behind other lower priority writes.  Second,
      the checksumming uses an ordered async work queue.  The ordering makes sure
      that IOs are sent to the block layer in the same order they are sent
      to the checksumming threads.  Usually this gives us less seeky IO.
      
      But, when we start mixing IO priorities, the lower priority IO can delay
      the higher priority IO.
      
      This patch solves both problems by adding a high priority list to the async
      helper threads, and a new btrfs_set_work_high_prio(), which is used
      to make put a new async work item onto the higher priority list.
      
      The ordering is still done on high priority IO, but all of the high
      priority bios are ordered separately from the low priority bios.  This
      ordering is purely an IO optimization, it is not involved in data
      or metadata integrity.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d313d7a3
  14. 07 11月, 2008 1 次提交
    • C
      Btrfs: Add ordered async work queues · 4a69a410
      Chris Mason 提交于
      Btrfs uses kernel threads to create async work queues for cpu intensive
      operations such as checksumming and decompression.  These work well,
      but they make it difficult to keep IO order intact.
      
      A single writepages call from pdflush or fsync will turn into a number
      of bios, and each bio is checksummed in parallel.  Once the checksum is
      computed, the bio is sent down to the disk, and since we don't control
      the order in which the parallel operations happen, they might go down to
      the disk in almost any order.
      
      The code deals with this somewhat by having deep work queues for a single
      kernel thread, making it very likely that a single thread will process all
      the bios for a single inode.
      
      This patch introduces an explicitly ordered work queue.  As work structs
      are placed into the queue they are put onto the tail of a list.  They have
      three callbacks:
      
      ->func (cpu intensive processing here)
      ->ordered_func (order sensitive processing here)
      ->ordered_free (free the work struct, all processing is done)
      
      The work struct has three callbacks.  The func callback does the cpu intensive
      work, and when it completes the work struct is marked as done.
      
      Every time a work struct completes, the list is checked to see if the head
      is marked as done.  If so the ordered_func callback is used to do the
      order sensitive processing and the ordered_free callback is used to do
      any cleanup.  Then we loop back and check the head of the list again.
      
      This patch also changes the checksumming code to use the ordered workqueues.
      One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a69a410
  15. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  16. 25 9月, 2008 3 次提交
    • C
      5443be45
    • C
      Btrfs: Worker thread optimizations · 35d8ba66
      Chris Mason 提交于
      This changes the worker thread pool to maintain a list of idle threads,
      avoiding a complex search for a good thread to wake up.
      
      Threads have two states:
      
      idle - we try to reuse the last thread used in hopes of improving the batching
      ratios
      
      busy - each time a new work item is added to a busy task, the task is
      rotated to the end of the line.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      35d8ba66
    • C
      Btrfs: Add async worker threads for pre and post IO checksumming · 8b712842
      Chris Mason 提交于
      Btrfs has been using workqueues to spread the checksumming load across
      other CPUs in the system.  But, workqueues only schedule work on the
      same CPU that queued the work, giving them a limited benefit for systems with
      higher CPU counts.
      
      This code adds a generic facility to schedule work with pools of kthreads,
      and changes the bio submission code to queue bios up.  The queueing is
      important to make sure large numbers of procs on the system don't
      turn streaming workloads into random workloads by sending IO down
      concurrently.
      
      The end result of all of this is much higher performance (and CPU usage) when
      doing checksumming on large machines.  Two worker pools are created,
      one for writes and one for endio processing.  The two could deadlock if
      we tried to service both from a single pool.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b712842