1. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  2. 18 4月, 2014 1 次提交
  3. 04 4月, 2014 2 次提交
    • J
      bdi: avoid oops on device removal · 5acda9d1
      Jan Kara 提交于
      After commit 839a8e86 ("writeback: replace custom worker pool
      implementation with unbound workqueue") when device is removed while we
      are writing to it we crash in bdi_writeback_workfn() ->
      set_worker_desc() because bdi->dev is NULL.
      
      This can happen because even though bdi_unregister() cancels all pending
      flushing work, nothing really prevents new ones from being queued from
      balance_dirty_pages() or other places.
      
      Fix the problem by clearing BDI_registered bit in bdi_unregister() and
      checking it before scheduling of any flushing work.
      
      Fixes: 839a8e86Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5acda9d1
    • D
      backing_dev: fix hung task on sync · 6ca738d6
      Derek Basehore 提交于
      bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
      schedule work to writeback dirty inodes.  The problem with this is that
      it can delay work that is scheduled for immediate execution, such as the
      work from sync_inodes_sb().  This can happen since mod_delayed_work()
      can now steal work from a work_queue.  This fixes the problem by using
      queue_delayed_work() instead.  This is a regression caused by commit
      839a8e86 ("writeback: replace custom worker pool implementation with
      unbound workqueue").
      
      The reason that this causes a problem is that laptop-mode will change
      the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
      In the case that bdi_wakeup_thread_delayed() races with
      sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
      task.  Even if dirty_writeback_centisecs is not long enough to cause a
      hung task, we still don't want to delay sync for that long.
      
      We fix the problem by using queue_delayed_work() when we want to
      schedule writeback sometime in future.  This function doesn't change the
      timer if it is already armed.
      
      For the same reason, we also change bdi_writeback_workfn() to
      immediately queue the work again in the case that the work_list is not
      empty.  The same problem can happen if the sync work is run on the
      rescue worker.
      
      [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
      Signed-off-by: NDerek Basehore <dbasehore@chromium.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zento.linux.org.uk>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Benson Leung <bleung@chromium.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Luigi Semenzato <semenzato@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ca738d6
  4. 12 9月, 2013 1 次提交
  5. 20 8月, 2013 1 次提交
  6. 17 7月, 2013 1 次提交
  7. 04 7月, 2013 1 次提交
  8. 02 4月, 2013 3 次提交
    • T
      writeback: expose the bdi_wq workqueue · b5c872dd
      Tejun Heo 提交于
      There are cases where userland wants to tweak the priority and
      affinity of writeback flushers.  Expose bdi_wq to userland by setting
      WQ_SYSFS.  It appears under /sys/bus/workqueue/devices/writeback/ and
      allows adjusting maximum concurrency level, cpumask and nice level.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5c872dd
    • T
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo 提交于
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
    • T
      writeback: remove unused bdi_pending_list · 181387da
      Tejun Heo 提交于
      There's no user left.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      181387da
  9. 22 2月, 2013 1 次提交
    • D
      bdi: allow block devices to say that they require stable page writes · 7d311cda
      Darrick J. Wong 提交于
      This patchset ("stable page writes, part 2") makes some key
      modifications to the original 'stable page writes' patchset.  First, it
      provides creators (devices and filesystems) of a backing_dev_info a flag
      that declares whether or not it is necessary to ensure that page
      contents cannot change during writeout.  It is no longer assumed that
      this is true of all devices (which was never true anyway).  Second, the
      flag is used to relaxed the wait_on_page_writeback calls so that wait
      only occurs if the device needs it.  Third, it fixes up the remaining
      disk-backed filesystems to use this improved conditional-wait logic to
      provide stable page writes on those filesystems.
      
      It is hoped that (for people not using checksumming devices, anyway)
      this patchset will give back unnecessary performance decreases since the
      original stable page write patchset went into 3.0.  Sorry about not
      fixing it sooner.
      
      Complaints were registered by several people about the long write
      latencies introduced by the original stable page write patchset.
      Generally speaking, the kernel ought to allocate as little extra memory
      as possible to facilitate writeout, but for people who simply cannot
      wait, a second page stability strategy is (re)introduced: snapshotting
      page contents.  The waiting behavior is still the default strategy; to
      enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
      set.  This flag is used to bandaid^Henable stable page writeback on
      ext3[1], and is not used anywhere else.
      
      Given that there are already a few storage devices and network FSes that
      have rolled their own page stability wait/page snapshot code, it would
      be nice to move towards consolidating all of these.  It seems possible
      that iscsi and raid5 may wish to use the new stable page write support
      to enable zero-copy writeout.
      
      Thank you to Jan Kara for helping fix a couple more filesystems.
      
      Per Andrew Morton's request, here are the result of using dbench to measure
      latencies on ext2:
      
      3.8.0-rc3:
         Operation      Count    AvgLat    MaxLat
         ----------------------------------------
         WriteX        109347     0.028    59.817
         ReadX         347180     0.004     3.391
         Flush          15514    29.828   287.283
      
        Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
         WriteX        105556     0.029     4.273
         ReadX         335004     0.005     4.112
         Flush          14982    30.540   298.634
      
        Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, for ext2 the maximum write latency decreases from ~60ms
      on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
      increase, though I suspect that being able to dirty pages faster gives
      the flusher more work to do.
      
      On ext4, the average write latency decreases as well as all the maximum
      latencies:
      
      3.8.0-rc3:
         WriteX         85624     0.152    33.078
         ReadX         272090     0.010    61.210
         Flush          12129    36.219   168.260
      
        Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms
      
      3.8.0-rc3 + patches:
         WriteX         86082     0.141    30.928
         ReadX         273358     0.010    36.124
         Flush          12214    34.800   165.689
      
        Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms
      
      XFS seems to exhibit similar latency improvements as ext2:
      
      3.8.0-rc3:
         WriteX        125739     0.028   104.343
         ReadX         399070     0.005     4.115
         Flush          17851    25.004   131.390
      
        Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms
      
      3.8.0-rc3 + patches:
         WriteX        123529     0.028     6.299
         ReadX         392434     0.005     4.287
         Flush          17549    25.120   188.687
      
        Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms
      
      ...and btrfs, just to round things out, also shows some latency
      decreases:
      
      3.8.0-rc3:
         WriteX         67122     0.083    82.355
         ReadX         212719     0.005     2.828
         Flush           9547    47.561   147.418
      
        Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms
      
      3.8.0-rc3 + patches:
         WriteX         64898     0.101    71.631
         ReadX         206673     0.005     7.123
         Flush           9190    47.963   219.034
      
        Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own wait code, or they don't block at all.  The blocking
      behavior is back to what it was before 3.0 if you don't have a disk
      requiring stable page writes.
      
      This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
      xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
      results as -rc3.
      
      [1] The alternative fixes to ext3 include fixing the locking order and
      page bit handling like we did for ext4 (but then why not just use
      ext4?), or setting PG_writeback so early that ext3 becomes extremely
      slow.  I tried that, but the number of write()s I could initiate dropped
      by nearly an order of magnitude.  That was a bit much even for the
      author of the stable page series! :)
      
      This patch:
      
      Creates a per-backing-device flag that tracks whether or not pages must
      be held immutable during writeout.  Eventually it will be used to waive
      wait_for_page_writeback() if nothing requires stable pages.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d311cda
  10. 18 12月, 2012 1 次提交
  11. 06 12月, 2012 1 次提交
  12. 25 8月, 2012 1 次提交
  13. 04 8月, 2012 1 次提交
    • A
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy 提交于
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  14. 01 8月, 2012 1 次提交
  15. 09 6月, 2012 1 次提交
  16. 01 2月, 2012 1 次提交
  17. 22 11月, 2011 1 次提交
    • T
      freezer: implement and use kthread_freezable_should_stop() · 8a32c441
      Tejun Heo 提交于
      Writeback and thinkpad_acpi have been using thaw_process() to prevent
      deadlock between the freezer and kthread_stop(); unfortunately, this
      is inherently racy - nothing prevents freezing from happening between
      thaw_process() and kthread_stop().
      
      This patch implements kthread_freezable_should_stop() which enters
      refrigerator if necessary but is guaranteed to return if
      kthread_stop() is invoked.  Both thaw_process() users are converted to
      use the new function.
      
      Note that this deadlock condition exists for many of freezable
      kthreads.  They need to be converted to use the new should_stop or
      freezable workqueue.
      
      Tested with synthetic test case.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      8a32c441
  18. 11 11月, 2011 1 次提交
    • R
      backing-dev: ensure wakeup_timer is deleted · 7a401a97
      Rabin Vincent 提交于
      bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
      from all super_blocks and then del_timer_sync() the writeback timer.
      
      However, this can race with __mark_inode_dirty(), leading to
      bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
      we're unregistering, after we've called del_timer_sync().
      
      This can end up with the bdi being freed with an active timer inside it,
      as in the case of the following dump after the removal of an SD card.
      
      Fix this by redoing the del_timer_sync() in bdi_destory().
      
       ------------[ cut here ]------------
       WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
       ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
       Modules linked in:
       Backtrace:
       [<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
        r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
       [<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
       [<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
        r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
       r3:00000009
       [<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
        r3:c02b1b29 r2:c02bc662
       [<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
        r6:c7964000 r5:00000001 r4:c7964000
       [<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
       [<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
       [<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
        r5:c79641f0 r4:c796420c
       [<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
        r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
       [<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
        r5:c79643b4 r4:c79641f0
       [<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
        r4:c7964000
       [<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
        r5:00000000 r4:c794c400
       [<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
        r5:c794c400 r4:c0322824
       [<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
        r5:c78d5e00 r4:c74083c0
       [<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
        r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
       r3:00000000
       [<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
        r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
       [<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
        r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
       [<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
        r5:c794dc00 r4:c794dc00
       [<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
        r5:c794dc00 r4:c7942300
       [<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
        r6:c7942300 r5:00000000 r4:00000000 r3:00000000
       [<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
       ---[ end trace e5c83c92ada51c76 ]---
      
      Cc: stable@kernel.org
      Signed-off-by: NRabin Vincent <rabin.vincent@stericsson.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a401a97
  19. 01 11月, 2011 1 次提交
  20. 31 10月, 2011 1 次提交
    • C
      writeback: Add a 'reason' to wb_writeback_work · 0e175a18
      Curt Wohlgemuth 提交于
      This creates a new 'reason' field in a wb_writeback_work
      structure, which unambiguously identifies who initiates
      writeback activity.  A 'wb_reason' enumeration has been
      added to writeback.h, to enumerate the possible reasons.
      
      The 'writeback_work_class' and tracepoint event class and
      'writeback_queue_io' tracepoints are updated to include the
      symbolic 'reason' in all trace events.
      
      And the 'writeback_inodes_sbXXX' family of routines has had
      a wb_stats parameter added to them, so callers can specify
      why writeback is being started.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      0e175a18
  21. 03 10月, 2011 3 次提交
    • W
      writeback: stabilize bdi->dirty_ratelimit · 7381131c
      Wu Fengguang 提交于
      There are some imperfections in balanced_dirty_ratelimit.
      
      1) large fluctuations
      
      The dirty_rate used for computing balanced_dirty_ratelimit is merely
      averaged in the past 200ms (very small comparing to the 3s estimation
      period for write_bw), which makes rather dispersed distribution of
      balanced_dirty_ratelimit.
      
      It's pretty hard to average out the singular points by increasing the
      estimation period. Considering that the averaging technique will
      introduce very undesirable time lags, I give it up totally. (btw, the 3s
      write_bw averaging time lag is much more acceptable because its impact
      is one-way and therefore won't lead to oscillations.)
      
      The more practical way is filtering -- most singular
      balanced_dirty_ratelimit points can be filtered out by remembering some
      prev_balanced_rate and prev_prev_balanced_rate. However the more
      reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
      
      2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
      match could become unbalanced, which may lead to large systematical
      errors in balanced_dirty_ratelimit. The truncates, due to its possibly
      bumpy nature, can hardly be compensated smoothly. So let's face it. When
      some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
      high, dirty pages will go higher than the setpoint. task_ratelimit will
      in turn become lower than dirty_ratelimit.  So if we consider both
      balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
      only when they are on the same side of dirty_ratelimit, the systematical
      errors in balanced_dirty_ratelimit won't be able to bring
      dirty_ratelimit far away.
      
      The balanced_dirty_ratelimit estimation may also be inaccurate near
      @limit or @freerun, however is less an issue.
      
      3) since we ultimately want to
      
      - keep the fluctuations of task ratelimit as small as possible
      - keep the dirty pages around the setpoint as long time as possible
      
      the update policy used for (2) also serves the above goals nicely:
      if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
      and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
      there is no point to bring up dirty_ratelimit in a hurry only to hurt
      both the above two goals.
      
      So, we make use of task_ratelimit to limit the update of dirty_ratelimit
      in two ways:
      
      1) avoid changing dirty rate when it's against the position control target
         (the adjusted rate will slow down the progress of dirty pages going
         back to setpoint).
      
      2) limit the step size. task_ratelimit is changing values step by step,
         leaving a consistent trace comparing to the randomly jumping
         balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
         errors in stable state and typically larger errors when there are big
         errors in rate.  So it's a pretty good limiting factor for the step
         size of dirty_ratelimit.
      
      Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
      task_ratelimit is merely used as a limiting factor.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7381131c
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27
    • W
      writeback: account per-bdi accumulated dirtied pages · c8e28ce0
      Wu Fengguang 提交于
      Introduce the BDI_DIRTIED counter. It will be used for estimating the
      bdi's dirty bandwidth.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Michael Rubin <mrubin@google.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8e28ce0
  22. 03 9月, 2011 2 次提交
  23. 26 7月, 2011 1 次提交
  24. 24 7月, 2011 1 次提交
  25. 10 7月, 2011 4 次提交
    • W
      writeback: show bdi write bandwidth in debugfs · 00821b00
      Wu Fengguang 提交于
      Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats.
      
      btw, increase digital field width to 10, for keeping the possibly
      huge BdiWritten number aligned at least for desktop systems.
      
      Impact: this could break user space tools if they are dumb enough to
      depend on the number of white spaces.
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      00821b00
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • J
      writeback: account per-bdi accumulated written pages · f7d2b1ec
      Jan Kara 提交于
      Introduce the BDI_WRITTEN counter. It will be used for estimating the
      bdi's write bandwidth.
      
      Peter Zijlstra <a.p.zijlstra@chello.nl>:
      Move BDI_WRITTEN accounting into __bdi_writeout_inc().
      This will cover and fix fuse, which only calls bdi_writeout_inc().
      
      CC: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f7d2b1ec
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  26. 08 6月, 2011 1 次提交
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  27. 21 5月, 2011 1 次提交
  28. 31 3月, 2011 1 次提交
  29. 25 3月, 2011 1 次提交
  30. 17 3月, 2011 1 次提交
  31. 10 3月, 2011 1 次提交