1. 30 10月, 2010 1 次提交
    • L
      fs-writeback.c: unify some common code · cdf01dd5
      Linus Torvalds 提交于
      The btrfs merge looks like hell, because it changes fs-writeback.c, and
      the crazy code has this repeated "estimate number of dirty pages"
      counting that involves three different helper functions.  And it's done
      in two different places.
      
      Just unify that whole calculation as a "get_nr_dirty_pages()" helper
      function, and the merge result will look half-way decent.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf01dd5
  2. 27 10月, 2010 3 次提交
  3. 26 10月, 2010 6 次提交
  4. 04 10月, 2010 1 次提交
    • C
      writeback: always use sb->s_bdi for writeback purposes · aaead25b
      Christoph Hellwig 提交于
      We currently use struct backing_dev_info for various different purposes.
      Originally it was introduced to describe a backing device which includes
      an unplug and congestion function and various bits of readahead information
      and VM-relevant flags.  We're also using for tracking dirty inodes for
      writeback.
      
      To make writeback properly find all inodes we need to only access the
      per-filesystem backing_device pointed to by the superblock in ->s_bdi
      inside the writeback code, and not the instances pointeded to by
      inode->i_mapping->backing_dev which can be overriden by special devices
      or might not be set at all by some filesystems.
      
      Long term we should split out the writeback-relevant bits of struct
      backing_device_info (which includes more than the current bdi_writeback)
      and only point to it from the superblock while leaving the traditional
      backing device as a separate structure that can be overriden by devices.
      
      The one exception for now is the block device filesystem which really
      wants different writeback contexts for it's different (internal) inodes
      to handle the writeout more efficiently.  For now we do this with
      a hack in fs-writeback.c because we're so late in the cycle, but in
      the future I plan to replace this with a superblock method that allows
      for multiple writeback contexts per filesystem.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      aaead25b
  5. 22 9月, 2010 1 次提交
    • J
      bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends · 692ebd17
      Jan Kara 提交于
      Inodes of devices such as /dev/zero can get dirty for example via
      utime(2) syscall or due to atime update. Backing device of such inodes
      (zero_bdi, etc.) is however unable to handle dirty inodes and thus
      __mark_inode_dirty complains.  In fact, inode should be rather dirtied
      against backing device of the filesystem holding it. This is generally a
      good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
      in these pseudofilesystems are referenced from ordinary filesystem
      inodes and carry mapping with real data of the device. Thus for these
      inodes we have to use inode->i_mapping->backing_dev_info as we did so
      far. We distinguish these filesystems by checking whether sb->s_bdi
      points to a non-trivial backing device or not.
      
      Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
      There's a device inode A described by a path "/dev/sdb" on this
      filesystem. This inode will be dirtied against backing device "8:0"
      after this patch. bdev filesystem contains block device inode B coupled
      with our inode A. When someone modifies a page of /dev/sdb, it's B that
      gets dirtied and the dirtying happens against the backing device "8:16".
      Thus both inodes get filed to a correct bdi list.
      
      Cc: stable@kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      692ebd17
  6. 28 8月, 2010 1 次提交
    • J
      writeback: Fix lost wake-up shutting down writeback thread · b76b4014
      J. Bruce Fields 提交于
      Setting the task state here may cause us to miss the wake up from
      kthread_stop(), so we need to recheck kthread_should_stop() or risk
      sleeping forever in the following schedule().
      
      Symptom was an indefinite hang on an NFSv4 mount.  (NFSv4 may create
      multiple mounts in a temporary namespace while traversing the mount
      path, and since the temporary namespace is immediately destroyed, it may
      end up destroying a mount very soon after it was created, possibly
      making this race more likely.)
      
      INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.nfs4    D 0000000000000000  2880  4314   4313 0x00000000
       ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
       ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
       ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
      Call Trace:
       [<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0
       [<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0
       [<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60
       [<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190
       [<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0
       [<ffffffff8195fc80>] wait_for_common+0x120/0x190
       [<ffffffff81033c70>] ? default_wake_function+0x0/0x20
       [<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20
       [<ffffffff810595fa>] kthread_stop+0x4a/0x150
       [<ffffffff81061a60>] ? thaw_process+0x70/0x80
       [<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0
       [<ffffffff81229dc9>] nfs_put_super+0x19/0x20
       [<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0
       [<ffffffff810ee9b6>] kill_anon_super+0x16/0x60
       [<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90
       [<ffffffff810eda45>] deactivate_locked_super+0x45/0x60
       [<ffffffff810edfb9>] deactivate_super+0x49/0x70
       [<ffffffff81108294>] mntput_no_expire+0x84/0xe0
       [<ffffffff811084ef>] release_mounts+0x9f/0xc0
       [<ffffffff81108575>] put_mnt_ns+0x65/0x80
       [<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420
       [<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0
       [<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360
       [<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0
       [<ffffffff810ede92>] do_kern_mount+0x52/0x130
       [<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170
       [<ffffffff81108e9e>] do_mount+0x26e/0x7f0
       [<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190
       [<ffffffff811094b8>] sys_mount+0x98/0xf0
       [<ffffffff810024d8>] system_call_fastpath+0x16/0x1b
      1 lock held by mount.nfs4/4314:
       #0:  (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      Acked-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      b76b4014
  7. 12 8月, 2010 5 次提交
  8. 10 8月, 2010 2 次提交
    • J
      mm: avoid resetting wb_start after each writeback round · 7624ee72
      Jan Kara 提交于
      WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
      write out some huge inode for too long while starving writeout of other
      inodes.  To avoid livelocks, we record time we started writeback in
      wbc->wb_start and do not write out inodes which were dirtied after this
      time.  But currently, writeback_inodes_wb() resets wb_start each time it
      is called thus effectively invalidating this logic and making any
      WB_SYNC_NONE writeback prone to livelocks.
      
      This patch makes sure wb_start is set only once when we start writeback.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7624ee72
    • A
      simplify checks for I_CLEAR/I_FREEING · a4ffdde6
      Al Viro 提交于
      add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
      equivalent to I_FREEING for almost all code looking at either;
      it's there to keep track of having called clear_inode() exactly
      once per inode lifetime, at some point after having set I_FREEING.
      I_CLEAR and I_FREEING never get set at the same time with the
      current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
      instead of I_CLEAR without loss of information.  As the result of
      such change, checks become simpler and the amount of code that needs
      to know about I_CLEAR shrinks a lot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4ffdde6
  9. 08 8月, 2010 12 次提交
    • A
      writeback: optimize periodic bdi thread wakeups · 6467716a
      Artem Bityutskiy 提交于
      Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
      should take care of the periodic background write-out. However, the write-out
      will actually start only 'dirty_writeback_interval' centisecs later, so we can
      delay the wake-up.
      
      This change was requested by Nick Piggin who pointed out that if we delay the
      wake-up, we weed out 2 unnecessary contex switches, which matters because
      '__mark_inode_dirty()' is a hot-path function.
      
      This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
      sets up a timer to wake-up the bdi thread and returns. So the wake-up is
      delayed.
      
      We also delete the timer in bdi threads just before writing-back. And
      synchronously delete it when unregistering bdi. At the unregister point the bdi
      does not have any users, so no one can arm it again.
      
      Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
      context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
      this change as well.
      
      This patch also moves the 'bdi_wb_init()' function down in the file to avoid
      forward-declaration of 'bdi_wakeup_thread_delayed()'.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6467716a
    • A
      writeback: prevent unnecessary bdi threads wakeups · 253c34e9
      Artem Bityutskiy 提交于
      Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
      bad for battery-driven devices.
      
      There are two types of activities bdi threads do:
      1. process bdi works from the 'bdi->work_list'
      2. periodic write-back
      
      So there are 2 sources of wake-up events for bdi threads:
      
      1. 'bdi_queue_work()' - submits bdi works
      2. '__mark_inode_dirty()' - adds dirty I/O to bdi's
      
      The former already has bdi wake-up code. The latter does not, and this patch
      adds it.
      
      '__mark_inode_dirty()' is hot-path function, but this patch adds another
      'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
      the bdi has no dirty inodes. So adding this spinlock should be fine and should
      not affect performance.
      
      This patch makes sure bdi threads and the forker thread do not wake-up if there
      is nothing to do. The forker thread will nevertheless wake up at least every
      5 min. to check whether it has to kill a bdi thread. This can also be optimized,
      but is not worth it.
      
      This patch also tidies up the warning about unregistered bid, and turns it from
      an ugly crocodile to a simple 'WARN()' statement.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      253c34e9
    • A
      writeback: move bdi threads exiting logic to the forker thread · fff5b85a
      Artem Bityutskiy 提交于
      Currently, bdi threads can decide to exit if there were no useful activities
      for 5 minutes. However, this causes nasty races: we can easily oops in the
      'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.
      
      And even if we do not oops, but the bdi tread exits immediately after we wake
      it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
      in the bdi work processing.
      
      This patch makes the forker thread to be the central place which not only
      creates bdi threads, but also kills them if they were inactive long enough.
      This better design-wise.
      
      Another reason why this change was done is to prepare for the further changes
      which will prevent the bdi threads from waking up every 5 sec and wasting
      power. Indeed, when the task does not wake up periodically anymore, it won't be
      able to exit either.
      
      This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
      forker thread as well. So now the forker thread sets the BDI_pending bit, then
      forks the task or kills it, then clears the bit and wakes up the waiting
      process.
      
      The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
      function was changed as well - now it first removes the bdi from the
      'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
      guaranteed that the forker thread won't race with it, because the bdi is not
      visible. Note, the forker thread sets the 'BDI_pending' bit under the
      'bdi->wb_lock' which is essential for proper serialization.
      
      And additionally, when we change 'bdi->wb.task', we now take the
      'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
      would when raced with, say, 'bdi_queue_work()'.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      fff5b85a
    • A
      writeback: move last_active to bdi · ecd58403
      Artem Bityutskiy 提交于
      Currently bdi threads use local variable 'last_active' which stores last time
      when the bdi thread did some useful work. Move this local variable to 'struct
      bdi_writeback'. This is just a preparation for the further patches which will
      make the forker thread decide when bdi threads should be killed.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ecd58403
    • A
      writeback: do not remove bdi from bdi_list · 78c40cb6
      Artem Bityutskiy 提交于
      The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
      But this is wrong for at least 2 reasons.
      
      Reason #1: if we temporary remove a bdi from the list, we may miss works which
                 would otherwise be given to us.
      
      Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
                 always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
                 it races with the forker thread, it can shut down the bdi thread
                 at the same time as the forker creates it.
      
      This patch makes sure the forker thread never removes bdis from 'bdi_list'
      (which was suggested by Christoph Hellwig).
      
      In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
      hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
      flag.
      
      NOTE! The error path is interesting. Currently, when we fail to create a bdi
      thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
      bdi from the list, we cannot move it to the tail either, because then we can
      mess up the RCU readers which walk the list. And also, we'll have the race
      described above in "Reason #2".
      
      But I not think that adding to the tail is any important so I just do not do
      that.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      78c40cb6
    • A
      writeback: do not lose wake-ups in bdi threads · 297252c8
      Artem Bityutskiy 提交于
      Currently, bdi threads ('bdi_writeback_thread()') can lose wake-ups. For
      example, if 'bdi_queue_work()' is executed after the bdi thread have had
      finished 'wb_do_writeback()' but before it called
      'schedule_timeout_interruptible()'.
      
      To fix this issue, we have to check whether we have works to process after we
      have changed the task state to 'TASK_INTERRUPTIBLE'.
      
      This patch also clean-ups handling of the cases when 'dirty_writeback_interval'
      is zero or non-zero.
      
      Additionally, this patch also removes unneeded 'list_empty_careful()' call.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      297252c8
    • A
      writeback: harmonize writeback threads naming · 6f904ff0
      Artem Bityutskiy 提交于
      The write-back code mixes words "thread" and "task" for the same things. This
      is not a big deal, but still an inconsistency.
      
      hch: a convention I tend to use and I've seen in various places
      is to always use _task for the storage of the task_struct pointer,
      and thread everywhere else.  This especially helps with having
      foo_thread for the actual thread and foo_task for a global
      variable keeping the task_struct pointer
      
      This patch renames:
      * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
      * 'bdi_forker_task()'              -> 'bdi_forker_thread()'
      
      because bdi threads are 'bdi_writeback_thread()', so these names are more
      consistent.
      
      This patch also amends commentaries and makes them refer the forker and bdi
      threads as "thread", not "task".
      
      Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
      'static void' instead of 'void static' and make checkpatch.pl happy.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6f904ff0
    • M
      writeback: remove wb in get_next_work_item · 08852b6d
      Minchan Kim 提交于
      83ba7b07 cleans up the writeback.
      So we don't use wb any more in get_next_work_item.
      Let's remove unnecessary argument.
      
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08852b6d
    • D
      writeback: Add tracing to balance_dirty_pages · 028c2dd1
      Dave Chinner 提交于
      Tracing high level background writeback events is good, but it doesn't
      give the entire picture. Add visibility into write throttling to catch IO
      dispatched by foreground throttling of processing dirtying lots of pages.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      028c2dd1
    • D
      writeback: Initial tracing support · 455b2864
      Dave Chinner 提交于
      Trace queue/sched/exec parts of the writeback loop. This provides
      insight into when and why flusher threads are scheduled to run. e.g
      a sync invocation leaves traces like:
      
           sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      
      This also lays the foundation for adding more writeback tracing to
      provide deeper insight into the whole writeback path.
      
      The original tracing code is from Jens Axboe, though this version is
      a rewrite as a result of the code being traced changing
      significantly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      455b2864
    • C
      writeback: merge bdi_writeback_task and bdi_start_fn · 08243900
      Christoph Hellwig 提交于
      Move all code for the writeback thread into fs/fs-writeback.c instead of
      splitting it over two functions in two files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08243900
    • C
      writeback: remove wb_list · c1955ce3
      Christoph Hellwig 提交于
      The wb_list member of struct backing_device_info always has exactly one
      element.  Just use the direct bdi->wb pointer instead and simplify some
      code.
      
      Also remove bdi_task_init which is now trivial to prepare for the next
      patch.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c1955ce3
  10. 06 7月, 2010 3 次提交
    • C
      writeback: simplify the write back thread queue · 83ba7b07
      Christoph Hellwig 提交于
      First remove items from work_list as soon as we start working on them.  This
      means we don't have to track any pending or visited state and can get
      rid of all the RCU magic freeing the work items - we can simply free
      them once the operation has finished.  Second use a real completion for
      tracking synchronous requests - if the caller sets the completion pointer
      we complete it, otherwise use it as a boolean indicator that we can free
      the work item directly.  Third unify struct wb_writeback_args and struct
      bdi_work into a single data structure, wb_writeback_work.  Previous we
      set all parameters into a struct wb_writeback_args, copied it into
      struct bdi_work, copied it again on the stack to use it there.  Instead
      of just allocate one structure dynamically or on the stack and use it
      all the way through the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      83ba7b07
    • C
      writeback: split writeback_inodes_wb · edadfb10
      Christoph Hellwig 提交于
      The case where we have a superblock doesn't require a loop here as we scan
      over all inodes in writeback_sb_inodes. Split it out into a separate helper
      to make the code simpler.  This also allows to get rid of the sb member in
      struct writeback_control, which was rather out of place there.
      
      Also update the comments in writeback_sb_inodes that explain the handling
      of inodes from wrong superblocks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      edadfb10
    • C
      writeback: remove writeback_inodes_wbc · 9c3a8ee8
      Christoph Hellwig 提交于
      This was just an odd wrapper around writeback_inodes_wb.  Removing this
      also allows to get rid of the bdi member of struct writeback_control
      which was rather out of place there.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      9c3a8ee8
  11. 01 7月, 2010 1 次提交
  12. 15 6月, 2010 1 次提交
  13. 11 6月, 2010 3 次提交