1. 08 8月, 2010 6 次提交
    • A
      writeback: do not lose wake-ups in the forker thread - 1 · c5f7ad23
      Artem Bityutskiy 提交于
      Currently the forker thread can lose wake-ups which may lead to unnecessary
      delays in processing bdi works. E.g., consider the following scenario.
      
      1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to
         do, and is about to finish the loop.
      2. A bdi thread decides to exit because it was inactive for long time.
      3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up
         the forker thread.
      4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)'
         and goes sleep. We lose a wake-up.
      
      Losing the wake-up is not fatal, but this means that the bdi work processing
      will be delayed by up to 5 sec. This race is theoretical, I never hit it, but
      it is worth fixing.
      
      The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking
      'bdi_list', not after.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c5f7ad23
    • A
      writeback: fix possible race when creating bdi threads · 94eac5e6
      Artem Bityutskiy 提交于
      This patch fixes a very unlikely race condition on the bdi forker thread error
      path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code
      for a short period of time. If at the same time someone submits a work to this
      bdi, we can end up with an oops 'bdi_queue_work()' while executing
      'wake_up_process(wb->task)'.
      
      This patch fixes the issue by introducing a temporary variable 'task' and
      storing the possible error code there, so that 'wb->task' would never take
      erroneous values.
      
      Note, this race is very unlikely and I never hit it, so it is theoretical, but
      nevertheless worth fixing.
      
      This patch also merges 2 comments which were previously separate.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      94eac5e6
    • A
      writeback: harmonize writeback threads naming · 6f904ff0
      Artem Bityutskiy 提交于
      The write-back code mixes words "thread" and "task" for the same things. This
      is not a big deal, but still an inconsistency.
      
      hch: a convention I tend to use and I've seen in various places
      is to always use _task for the storage of the task_struct pointer,
      and thread everywhere else.  This especially helps with having
      foo_thread for the actual thread and foo_task for a global
      variable keeping the task_struct pointer
      
      This patch renames:
      * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
      * 'bdi_forker_task()'              -> 'bdi_forker_thread()'
      
      because bdi threads are 'bdi_writeback_thread()', so these names are more
      consistent.
      
      This patch also amends commentaries and makes them refer the forker and bdi
      threads as "thread", not "task".
      
      Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
      'static void' instead of 'void static' and make checkpatch.pl happy.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6f904ff0
    • D
      writeback: Initial tracing support · 455b2864
      Dave Chinner 提交于
      Trace queue/sched/exec parts of the writeback loop. This provides
      insight into when and why flusher threads are scheduled to run. e.g
      a sync invocation leaves traces like:
      
           sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      
      This also lays the foundation for adding more writeback tracing to
      provide deeper insight into the whole writeback path.
      
      The original tracing code is from Jens Axboe, though this version is
      a rewrite as a result of the code being traced changing
      significantly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      455b2864
    • C
      writeback: merge bdi_writeback_task and bdi_start_fn · 08243900
      Christoph Hellwig 提交于
      Move all code for the writeback thread into fs/fs-writeback.c instead of
      splitting it over two functions in two files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08243900
    • C
      writeback: remove wb_list · c1955ce3
      Christoph Hellwig 提交于
      The wb_list member of struct backing_device_info always has exactly one
      element.  Just use the direct bdi->wb pointer instead and simplify some
      code.
      
      Also remove bdi_task_init which is now trivial to prepare for the next
      patch.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c1955ce3
  2. 06 7月, 2010 2 次提交
    • C
      writeback: simplify the write back thread queue · 83ba7b07
      Christoph Hellwig 提交于
      First remove items from work_list as soon as we start working on them.  This
      means we don't have to track any pending or visited state and can get
      rid of all the RCU magic freeing the work items - we can simply free
      them once the operation has finished.  Second use a real completion for
      tracking synchronous requests - if the caller sets the completion pointer
      we complete it, otherwise use it as a boolean indicator that we can free
      the work item directly.  Third unify struct wb_writeback_args and struct
      bdi_work into a single data structure, wb_writeback_work.  Previous we
      set all parameters into a struct wb_writeback_args, copied it into
      struct bdi_work, copied it again on the stack to use it there.  Instead
      of just allocate one structure dynamically or on the stack and use it
      all the way through the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      83ba7b07
    • C
      writeback: remove writeback_inodes_wbc · 9c3a8ee8
      Christoph Hellwig 提交于
      This was just an odd wrapper around writeback_inodes_wb.  Removing this
      also allows to get rid of the bdi member of struct writeback_control
      which was rather out of place there.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      9c3a8ee8
  3. 22 5月, 2010 1 次提交
    • J
      writeback: fixups for !dirty_writeback_centisecs · 6423104b
      Jens Axboe 提交于
      Commit 69b62d01 fixed up most of the places where we would enter
      busy schedule() spins when disabling the periodic background
      writeback. This fixes up the sb timer so that it doesn't get
      hammered on with the delay disabled, and ensures that it gets
      rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
      gets modified.
      
      bdi_forker_task() also needs to check for !dirty_writeback_centisecs
      and use schedule() appropriately, fix that up too.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6423104b
  4. 25 4月, 2010 1 次提交
    • J
      Catch filesystems lacking s_bdi · 5129a469
      Jörn Engel 提交于
      noop_backing_dev_info is used only as a flag to mark filesystems that
      don't have any backing store, like tmpfs, procfs, spufs, etc.
      Signed-off-by: NJoern Engel <joern@logfs.org>
      
      Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
      to the noop_backing_dev_info is not legal and will not result in
      them being flushed, but we already catch this condition in
      __mark_inode_dirty() when checking for a registered bdi.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5129a469
  5. 22 4月, 2010 1 次提交
  6. 02 4月, 2010 1 次提交
  7. 03 12月, 2009 1 次提交
  8. 12 11月, 2009 1 次提交
    • R
      Thaw refrigerated bdi flusher threads before invoking kthread_stop on them · c62b17a5
      Romit Dasgupta 提交于
      Unfreezes the bdi flusher task when the said task needs to exit.
      
      Steps to reproduce this.
      1) Mount a file system from MMC/SD card.
      2) Unmount the file system. This creates a flusher task.
      3) Attempt suspend to RAM. System is unresponsive.
      
      This is because the bdi flusher thread is already in the refrigerator and will
      remain so until it is thawed. The MMC driver suspend routine call stack will
      ultimately issue a 'kthread_stop' on the bdi flusher thread and will block
      until the flusher thread is exited. Since the bdi flusher thread is in the
      refrigerator it never cleans up until thawed.
      Signed-off-by: NRomit Dasgupta <romit@ti.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c62b17a5
  9. 04 11月, 2009 1 次提交
  10. 29 10月, 2009 1 次提交
  11. 09 10月, 2009 1 次提交
  12. 16 9月, 2009 2 次提交
  13. 11 9月, 2009 5 次提交
    • J
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe 提交于
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      500b067c
    • J
      writeback: add name to backing_dev_info · d993831f
      Jens Axboe 提交于
      This enables us to track who does what and print info. Its main use
      is catching dirty inodes on the default_backing_dev_info, so we can
      fix that up.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d993831f
    • J
      writeback: add some debug inode list counters to bdi stats · f09b00d3
      Jens Axboe 提交于
      Add some debug entries to be able to inspect the internal state of
      the writeback details.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f09b00d3
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  14. 11 7月, 2009 1 次提交
  15. 06 4月, 2009 1 次提交
  16. 26 3月, 2009 1 次提交
  17. 07 1月, 2009 1 次提交
  18. 29 12月, 2008 1 次提交
  19. 11 12月, 2008 1 次提交
  20. 03 12月, 2008 1 次提交
  21. 21 5月, 2008 1 次提交
    • G
      mm: bdi: fix race in bdi_class device creation · 19051c50
      Greg Kroah-Hartman 提交于
      There is a race from when a device is created with device_create() and
      then the drvdata is set with a call to dev_set_drvdata() in which a
      sysfs file could be open, yet the drvdata will be NULL, causing all
      sorts of bad things to happen.
      
      This patch fixes the problem by using the new function,
      device_create_vargs().
      
      Many thanks to Arthur Jones <ajones@riverbed.com> for reporting the bug,
      and testing patches out.
      
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Arthur Jones <ajones@riverbed.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      19051c50
  22. 30 4月, 2008 4 次提交
  23. 06 12月, 2007 1 次提交
  24. 17 10月, 2007 3 次提交
    • P
      mm: per device dirty threshold · 04fbfdc1
      Peter Zijlstra 提交于
      Scale writeback cache per backing device, proportional to its writeout speed.
      
      By decoupling the BDI dirty thresholds a number of problems we currently have
      will go away, namely:
      
       - mutual interference starvation (for any number of BDIs);
       - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
      
      It might be that all dirty pages are for a single BDI while other BDIs are
      idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
      dirty pages outstanding and make progress.
      
      A global threshold also creates a deadlock for stacked BDIs; when A writes to
      B, and A generates enough dirty pages to get throttled, B will never start
      writeback until the dirty pages go away. Again, by giving each BDI its own
      'independent' dirty limit, this problem is avoided.
      
      So the problem is to determine how to distribute the total dirty limit across
      the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
      not have any dirty pages outstanding is a waste.
      
      What is done is to keep a floating proportion between the DBIs based on
      writeback completions. This way faster/more active devices get a larger share
      than slower/idle devices.
      
      [akpm@linux-foundation.org: fix warnings]
      [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fbfdc1
    • P
      mm: scalable bdi statistics counters · b2e8fb6e
      Peter Zijlstra 提交于
      Provide scalable per backing_dev_info statistics counters.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2e8fb6e
    • P
      nfs: remove congestion_end() · c4dc4bee
      Peter Zijlstra 提交于
      These patches aim to improve balance_dirty_pages() and directly address three
      issues:
        1) inter device starvation
        2) stacked device deadlocks
        3) inter process starvation
      
      1 and 2 are a direct result from removing the global dirty limit and using
      per device dirty limits. By giving each device its own dirty limit is will
      no longer starve another device, and the cyclic dependancy on the dirty limit
      is broken.
      
      In order to efficiently distribute the dirty limit across the independant
      devices a floating proportion is used, this will allocate a share of the total
      limit proportional to the device's recent activity.
      
      3 is done by also scaling the dirty limit proportional to the current task's
      recent dirty rate.
      
      This patch:
      
      nfs: remove congestion_end().  It's redundant, clear_bdi_congested() already
      wakes the waiters.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4dc4bee