1. 08 6月, 2011 5 次提交
    • W
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang 提交于
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • W
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang 提交于
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • W
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang 提交于
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • W
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang 提交于
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • W
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang 提交于
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: NJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  2. 27 5月, 2011 1 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
  3. 31 3月, 2011 1 次提交
  4. 25 3月, 2011 4 次提交
    • D
      fs: pull inode->i_lock up out of writeback_single_inode · 0f1b1fd8
      Dave Chinner 提交于
      First thing we do in writeback_single_inode() is take the i_lock and
      the last thing we do is drop it. A caller already holds the i_lock,
      so pull the i_lock out of writeback_single_inode() to reduce the
      round trips on this lock during inode writeback.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0f1b1fd8
    • D
      fs: move i_wb_list out from under inode_lock · a66979ab
      Dave Chinner 提交于
      Protect the inode writeback list with a new global lock
      inode_wb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a66979ab
    • D
      fs: move i_sb_list out from under inode_lock · 55fa6091
      Dave Chinner 提交于
      Protect the per-sb inode list with a new global lock
      inode_sb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55fa6091
    • D
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner 提交于
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  5. 14 1月, 2011 6 次提交
    • S
      fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc · cb9ef8d5
      Stefan Hajnoczi 提交于
      The sync_inodes_sb() function does not have a return value.  Remove the
      outdated documentation comment.
      Signed-off-by: NStefan Hajnoczi <stefanha@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb9ef8d5
    • A
      sync_inode_metadata: fix comment · c691b9d9
      Andrew Morton 提交于
      Use correct function name, remove incorrect apostrophe
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c691b9d9
    • J
      writeback: avoid livelocking WB_SYNC_ALL writeback · b9543dac
      Jan Kara 提交于
      When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
      usually set to LONG_MAX.  The logic in wb_writeback() then calls
      __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
      easily end up with non-positive nr_to_write after the function returns, if
      the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
      
      When nr_to_write is <= 0 wb_writeback() decides we need another round of
      writeback but this is wrong in some cases!  For example when a single
      large file is continuously dirtied, we would never finish syncing it
      because each pass would be able to write MAX_WRITEBACK_PAGES and inode
      dirty timestamp never gets updated (as inode is never completely clean).
      Thus __writeback_inodes_sb() would write the redirtied inode again and
      again.
      
      Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
      do not need nr_to_write in WB_SYNC_ALL mode anyway since
      write_cache_pages() does livelock avoidance using page tagging in
      WB_SYNC_ALL mode.
      
      This makes wb_writeback() call __writeback_inodes_sb() only once on
      WB_SYNC_ALL.  The latter function won't livelock because it works on
      
      - a finite set of files by doing queue_io() once at the beginning
      - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
      
      After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
      longer able to stall sync forever.
      
      [fengguang.wu@intel.com: fix locking comment]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9543dac
    • J
      writeback: stop background/kupdate works from livelocking other works · aa373cf5
      Jan Kara 提交于
      Background writeback is easily livelockable in a loop in wb_writeback() by
      a process continuously re-dirtying pages (or continuously appending to a
      file).  This is in fact intended as the target of background writeback is
      to write dirty pages it can find as long as we are over
      dirty_background_threshold.
      
      But the above behavior gets inconvenient at times because no other work
      queued in the flusher thread's queue gets processed.  In particular, since
      e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
      can hang forever waiting for flusher thread to do the work.
      
      Generally, when a flusher thread has some work queued, someone submitted
      the work to achieve a goal more specific than what background writeback
      does.  Moreover by working on the specific work, we also reduce amount of
      dirty pages which is exactly the target of background writeout.  So it
      makes sense to give specific work a priority over a generic page cleaning.
      
      Thus we interrupt background writeback if there is some other work to do.
      We return to the background writeback after completing all the queued
      work.
      
      This may delay the writeback of expired inodes for a while, however the
      expired inodes will eventually be flushed to disk as long as the other
      works won't livelock.
      
      [fengguang.wu@intel.com: update comment]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa373cf5
    • W
      writeback: trace wakeup event for background writeback · 71927e84
      Wu Fengguang 提交于
      This tracks when balance_dirty_pages() tries to wakeup the flusher thread
      for background writeback (if it was not started already).
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71927e84
    • J
      writeback: integrated background writeback work · 6585027a
      Jan Kara 提交于
      Check whether background writeback is needed after finishing each work.
      
      When bdi flusher thread finishes doing some work check whether any kind of
      background writeback needs to be done (either because
      dirty_background_ratio is exceeded or because we need to start flushing
      old inodes).  If so, just do background write back.
      
      This way, bdi_start_background_writeback() just needs to wake up the
      flusher thread.  It will do background writeback as soon as there is no
      other work.
      
      This is a preparatory patch for the next patch which stops background
      writeback as soon as there is other work to do.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6585027a
  6. 30 10月, 2010 1 次提交
    • L
      fs-writeback.c: unify some common code · cdf01dd5
      Linus Torvalds 提交于
      The btrfs merge looks like hell, because it changes fs-writeback.c, and
      the crazy code has this repeated "estimate number of dirty pages"
      counting that involves three different helper functions.  And it's done
      in two different places.
      
      Just unify that whole calculation as a "get_nr_dirty_pages()" helper
      function, and the merge result will look half-way decent.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf01dd5
  7. 29 10月, 2010 1 次提交
    • C
      Add new functions for triggering inode writeback · 3259f8be
      Chris Mason 提交于
      When btrfs is running low on metadata space, it needs to force delayed
      allocation pages to disk.  It currently does this with a suboptimal walk
      of a private list of inodes with delayed allocation, and it would be
      much better if we used the generic flusher threads.
      
      writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
      thread to start IO on all the dirty pages in the FS before it returns.
      This adds variants of writeback_inodes_sb* that allow the caller to
      control how many pages get sent down.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3259f8be
  8. 27 10月, 2010 3 次提交
  9. 26 10月, 2010 6 次提交
  10. 04 10月, 2010 1 次提交
    • C
      writeback: always use sb->s_bdi for writeback purposes · aaead25b
      Christoph Hellwig 提交于
      We currently use struct backing_dev_info for various different purposes.
      Originally it was introduced to describe a backing device which includes
      an unplug and congestion function and various bits of readahead information
      and VM-relevant flags.  We're also using for tracking dirty inodes for
      writeback.
      
      To make writeback properly find all inodes we need to only access the
      per-filesystem backing_device pointed to by the superblock in ->s_bdi
      inside the writeback code, and not the instances pointeded to by
      inode->i_mapping->backing_dev which can be overriden by special devices
      or might not be set at all by some filesystems.
      
      Long term we should split out the writeback-relevant bits of struct
      backing_device_info (which includes more than the current bdi_writeback)
      and only point to it from the superblock while leaving the traditional
      backing device as a separate structure that can be overriden by devices.
      
      The one exception for now is the block device filesystem which really
      wants different writeback contexts for it's different (internal) inodes
      to handle the writeout more efficiently.  For now we do this with
      a hack in fs-writeback.c because we're so late in the cycle, but in
      the future I plan to replace this with a superblock method that allows
      for multiple writeback contexts per filesystem.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      aaead25b
  11. 22 9月, 2010 1 次提交
    • J
      bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends · 692ebd17
      Jan Kara 提交于
      Inodes of devices such as /dev/zero can get dirty for example via
      utime(2) syscall or due to atime update. Backing device of such inodes
      (zero_bdi, etc.) is however unable to handle dirty inodes and thus
      __mark_inode_dirty complains.  In fact, inode should be rather dirtied
      against backing device of the filesystem holding it. This is generally a
      good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
      in these pseudofilesystems are referenced from ordinary filesystem
      inodes and carry mapping with real data of the device. Thus for these
      inodes we have to use inode->i_mapping->backing_dev_info as we did so
      far. We distinguish these filesystems by checking whether sb->s_bdi
      points to a non-trivial backing device or not.
      
      Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
      There's a device inode A described by a path "/dev/sdb" on this
      filesystem. This inode will be dirtied against backing device "8:0"
      after this patch. bdev filesystem contains block device inode B coupled
      with our inode A. When someone modifies a page of /dev/sdb, it's B that
      gets dirtied and the dirtying happens against the backing device "8:16".
      Thus both inodes get filed to a correct bdi list.
      
      Cc: stable@kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      692ebd17
  12. 28 8月, 2010 1 次提交
    • J
      writeback: Fix lost wake-up shutting down writeback thread · b76b4014
      J. Bruce Fields 提交于
      Setting the task state here may cause us to miss the wake up from
      kthread_stop(), so we need to recheck kthread_should_stop() or risk
      sleeping forever in the following schedule().
      
      Symptom was an indefinite hang on an NFSv4 mount.  (NFSv4 may create
      multiple mounts in a temporary namespace while traversing the mount
      path, and since the temporary namespace is immediately destroyed, it may
      end up destroying a mount very soon after it was created, possibly
      making this race more likely.)
      
      INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.nfs4    D 0000000000000000  2880  4314   4313 0x00000000
       ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
       ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
       ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
      Call Trace:
       [<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0
       [<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0
       [<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60
       [<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190
       [<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0
       [<ffffffff8195fc80>] wait_for_common+0x120/0x190
       [<ffffffff81033c70>] ? default_wake_function+0x0/0x20
       [<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20
       [<ffffffff810595fa>] kthread_stop+0x4a/0x150
       [<ffffffff81061a60>] ? thaw_process+0x70/0x80
       [<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0
       [<ffffffff81229dc9>] nfs_put_super+0x19/0x20
       [<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0
       [<ffffffff810ee9b6>] kill_anon_super+0x16/0x60
       [<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90
       [<ffffffff810eda45>] deactivate_locked_super+0x45/0x60
       [<ffffffff810edfb9>] deactivate_super+0x49/0x70
       [<ffffffff81108294>] mntput_no_expire+0x84/0xe0
       [<ffffffff811084ef>] release_mounts+0x9f/0xc0
       [<ffffffff81108575>] put_mnt_ns+0x65/0x80
       [<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420
       [<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0
       [<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360
       [<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0
       [<ffffffff810ede92>] do_kern_mount+0x52/0x130
       [<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170
       [<ffffffff81108e9e>] do_mount+0x26e/0x7f0
       [<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190
       [<ffffffff811094b8>] sys_mount+0x98/0xf0
       [<ffffffff810024d8>] system_call_fastpath+0x16/0x1b
      1 lock held by mount.nfs4/4314:
       #0:  (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      Acked-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      b76b4014
  13. 12 8月, 2010 5 次提交
  14. 10 8月, 2010 2 次提交
    • J
      mm: avoid resetting wb_start after each writeback round · 7624ee72
      Jan Kara 提交于
      WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
      write out some huge inode for too long while starving writeout of other
      inodes.  To avoid livelocks, we record time we started writeback in
      wbc->wb_start and do not write out inodes which were dirtied after this
      time.  But currently, writeback_inodes_wb() resets wb_start each time it
      is called thus effectively invalidating this logic and making any
      WB_SYNC_NONE writeback prone to livelocks.
      
      This patch makes sure wb_start is set only once when we start writeback.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7624ee72
    • A
      simplify checks for I_CLEAR/I_FREEING · a4ffdde6
      Al Viro 提交于
      add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
      equivalent to I_FREEING for almost all code looking at either;
      it's there to keep track of having called clear_inode() exactly
      once per inode lifetime, at some point after having set I_FREEING.
      I_CLEAR and I_FREEING never get set at the same time with the
      current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
      instead of I_CLEAR without loss of information.  As the result of
      such change, checks become simpler and the amount of code that needs
      to know about I_CLEAR shrinks a lot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4ffdde6
  15. 08 8月, 2010 2 次提交
    • A
      writeback: optimize periodic bdi thread wakeups · 6467716a
      Artem Bityutskiy 提交于
      Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
      should take care of the periodic background write-out. However, the write-out
      will actually start only 'dirty_writeback_interval' centisecs later, so we can
      delay the wake-up.
      
      This change was requested by Nick Piggin who pointed out that if we delay the
      wake-up, we weed out 2 unnecessary contex switches, which matters because
      '__mark_inode_dirty()' is a hot-path function.
      
      This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
      sets up a timer to wake-up the bdi thread and returns. So the wake-up is
      delayed.
      
      We also delete the timer in bdi threads just before writing-back. And
      synchronously delete it when unregistering bdi. At the unregister point the bdi
      does not have any users, so no one can arm it again.
      
      Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
      context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
      this change as well.
      
      This patch also moves the 'bdi_wb_init()' function down in the file to avoid
      forward-declaration of 'bdi_wakeup_thread_delayed()'.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6467716a
    • A
      writeback: prevent unnecessary bdi threads wakeups · 253c34e9
      Artem Bityutskiy 提交于
      Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
      bad for battery-driven devices.
      
      There are two types of activities bdi threads do:
      1. process bdi works from the 'bdi->work_list'
      2. periodic write-back
      
      So there are 2 sources of wake-up events for bdi threads:
      
      1. 'bdi_queue_work()' - submits bdi works
      2. '__mark_inode_dirty()' - adds dirty I/O to bdi's
      
      The former already has bdi wake-up code. The latter does not, and this patch
      adds it.
      
      '__mark_inode_dirty()' is hot-path function, but this patch adds another
      'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
      the bdi has no dirty inodes. So adding this spinlock should be fine and should
      not affect performance.
      
      This patch makes sure bdi threads and the forker thread do not wake-up if there
      is nothing to do. The forker thread will nevertheless wake up at least every
      5 min. to check whether it has to kill a bdi thread. This can also be optimized,
      but is not worth it.
      
      This patch also tidies up the warning about unregistered bid, and turns it from
      an ugly crocodile to a simple 'WARN()' statement.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      253c34e9