1. 17 10月, 2007 40 次提交
    • M
      fuse: clean up execute permission checking · e8e96157
      Miklos Szeredi 提交于
      Define a new function fuse_refresh_attributes() that conditionally refreshes
      the attributes based on the validity timeout.
      
      In fuse_permission() only refresh the attributes for checking the execute bits
      if necessary.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8e96157
    • M
      fuse: no ENOENT from fuse device read · c9c9d7df
      Miklos Szeredi 提交于
      Don't return -ENOENT for a read() on the fuse device when the request was
      aborted.  Instead return -ENODEV, meaning the filesystem has been
      force-umounted or aborted.
      
      Previously ENOENT meant that the request was interrupted, but now the
      'aborted' flag is not set in case of interrupts.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9c9d7df
    • M
      fuse: no abort on interrupt · a131de0a
      Miklos Szeredi 提交于
      Don't set 'aborted' flag on a request if it's interrupted.  We have to wait
      for the answer anyway, and this would only a very little time while copying
      the reply.
      
      This means, that write() on the fuse device will not return -ENOENT during
      normal operation, only if the filesystem is aborted by a forced umount or
      through the fusectl interface.
      
      This could simplify userspace code somewhat when backward compatibility with
      earlier kernel versions is not required.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a131de0a
    • M
      fuse: cleanup in release · 819c4b3b
      Miklos Szeredi 提交于
      Move dput/mntput pair from request_end() to fuse_release_end(), because
      there's no other place they are used.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      819c4b3b
    • M
      fuse: fix permission checking on sticky directories · ebc14c4d
      Miklos Szeredi 提交于
      The VFS checks sticky bits on the parent directory even if the filesystem
      defines it's own ->permission().  In some situations (sshfs, mountlo, etc) the
      user does have permission to delete a file even if the attribute based
      checking would not allow it.
      
      So work around this by storing the permission bits separately and returning
      them in stat(), but cutting the permission bits off from inode->i_mode.
      
      This is slightly hackish, but it's probably not worth it to add new
      infrastructure in VFS and a slight performance penalty for all filesystems,
      just for the sake of fuse.
      
      [Jan Engelhardt] cosmetic fixes
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc14c4d
    • M
      fuse: refresh stale attributes in fuse_permission() · 244f6385
      Miklos Szeredi 提交于
      fuse_permission() didn't refresh inode attributes before using them, even if
      the validity has already expired.
      
      Thanks to Junjiro Okajima for spotting this.
      
      Also remove some old code to unconditionally refresh the attributes on the
      root inode.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      244f6385
    • M
      fuse: set i_nlink to sane value after mount · 074406fa
      Miklos Szeredi 提交于
      Aufs seems to depend on a positive i_nlink value.  So fill in a dummy but sane
      value for the root inode at mount time.
      
      The inode attributes are refreshed with the correct values at the first
      opportunity.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074406fa
    • M
      fuse: fix page invalidation · b1009979
      Miklos Szeredi 提交于
      Other than truncate, there are two cases, when fuse tries to get rid
      of cached pages:
      
       a) in open, if KEEP_CACHE flag is not set
       b) in getattr, if file size changed spontaneously
      
      Until now invalidate_mapping_pages() were used, which didn't get rid
      of mapped pages.  This is wrong, and becomes more wrong as dirty pages
      are introduced.  So instead properly invalidate all pages with
      invalidate_inode_pages2().
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1009979
    • M
      fuse: truncate on spontaneous size change · e00d2c2d
      Miklos Szeredi 提交于
      Memory mappings were only truncated on an explicit truncate, but not when the
      file size was changed externally.
      
      Fix this by moving the truncation code from fuse_setattr to
      fuse_change_attributes.
      
      Yes, there are races between write and and external truncation, but we can't
      really do anything about them.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e00d2c2d
    • M
      fuse: add reference counting to fuse_file · c756e0a4
      Miklos Szeredi 提交于
      Make lifetime of 'struct fuse_file' independent from 'struct file' by adding a
      reference counter and destructor.
      
      This will enable asynchronous page writeback, where it cannot be guaranteed,
      that the file is not released while a request with this file handle is being
      served.
      
      The actual RELEASE request is only sent when there are no more references to
      the fuse_file.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c756e0a4
    • M
      fuse: fix reserved request wake up · de5e3dec
      Miklos Szeredi 提交于
      Use wake_up_all instead of wake_up in put_reserved_req(), otherwise it is
      possible that the right task is not woken up.
      
      Also create a separate reserved_req_waitq in addition to the blocked_waitq,
      since they fulfill totally separate functions.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de5e3dec
    • M
      fuse: update backing_dev_info congestion state · f92b99b9
      Miklos Szeredi 提交于
      Set the read and write congestion state if the request queue is close to
      blocking, and clear it when it's not.
      
      This prevents unnecessary blocking in readahead and (when writable mmaps are
      allowed) writeback.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f92b99b9
    • J
      floppy: remove register keyword use from floppy driver · fdc1ca8a
      Jesper Juhl 提交于
      The floppy drive is slow.  These days I see absolutely no good reason why the
      floppy driver should try to gain a tiny bit of speed by telling gcc to
      optimize access to some variables via the register keyword.  Better to just
      leave gcc free to do whatever optimizations it deduces to be sane and not
      hamper it by telling it that some variables in the floppy driver are special
      and need to be fast (they don't).
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdc1ca8a
    • J
      floppy: remove dead/commented out code from floppy driver · aee9041c
      Jesper Juhl 提交于
      A good initial step for a cleanup seems to me to be getting rid of old dead
      code.  This stuff is either commented out or inside '#if 0' so it is not
      currently in use at all, let's just get rid of it once and for all.  That's a
      few lines less to deal with.
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aee9041c
    • J
      floppy: do a very minimal style cleanup of the floppy driver · 06f748c4
      Jesper Juhl 提交于
      Yes, some of this will likely be replaced in later patches, but I do not see
      anyone else coming out of the woodwork with any patches for this driver, so
      I'll ignore comments about churn.  I want to get this driver cleaned up, and
      if I'm going to do so I want to start with this basic style cleanup to reduce
      the reading pain a bit.
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06f748c4
    • O
      migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock() · d2da272a
      Oleg Nesterov 提交于
      Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
      task_rq_lock(rq->idle), rq->idle can't change its task_rq().
      
      This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
      which uses spin_lock_irq/spin_unlock_irq.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Cliff Wickman <cpw@sgi.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2da272a
    • O
      do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist) · f7b4cddc
      Oleg Nesterov 提交于
      Currently move_task_off_dead_cpu() is called under
      write_lock_irq(tasklist).  This means it can't use task_lock() which is
      needed to improve migrating to take task's ->cpuset into account.
      
      Change the code to call move_task_off_dead_cpu() with irqs enabled, and
      change migrate_live_tasks() to use read_lock(tasklist).
      
      This all is a preparation for the futher changes proposed by Cliff Wickman, see
      	http://marc.info/?t=117327786100003Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Cliff Wickman <cpw@sgi.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7b4cddc
    • N
      md: make sure read errors are auto-corrected during a 'check' resync in raid1 · cf7a4416
      NeilBrown 提交于
      Whenever a read error is found, we should attempt to overwrite with correct
      data to 'fix' it.
      
      However when do a 'check' pass (which compares data blocks that are
      successfully read, but doesn't normally overwrite) we don't do that.  We
      should.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf7a4416
    • I
      md: expose the degraded status of an assembled array through sysfs · d7f3d291
      Iustin Pop 提交于
      The 'degraded' attribute is useful to quickly determine if the array is
      degraded, instead of parsing 'mdadm -D' output or relying on the other
      techniques (number of working devices against number of defined devices,
      etc.).  The md code already keeps track of this attribute, so it's useful to
      export it.
      Signed-off-by: NIustin Pop <iusty@k1024.org>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7f3d291
    • N
      md: 'sync_action' in sysfs returns wrong value for readonly arrays · 2b12ab6d
      NeilBrown 提交于
      When an array is started read-only, MD_RECOVERY_NEEDED can be set but no
      recovery will be running.  This causes 'sync_action' to report the wrong
      value.
      
      We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a
      small gap after requesting a sync action, where 'sync_action' would still
      report the old value.
      
      So make sure that for a read-only array, 'sync_action' always returns 'idle'.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b12ab6d
    • N
      md: fix a bug in some never-used code. · 8299d7f7
      NeilBrown 提交于
      http://bugzilla.kernel.org/show_bug.cgi?id=3277
      
      There is a seq_printf here that isn't being passed a 'seq'.  Howeve as the
      code is inside #ifdef MD_DEBUG, nobody noticed.
      
      Also remove some extra spaces.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8299d7f7
    • A
      bitmap.h: remove dead artifacts · 5ebf2c12
      Adrian Bunk 提交于
      bitmap_active() no longer exists and BITMAP_ACTIVE is no longer used.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ebf2c12
    • M
      md: software Raid autodetect dev list not array · 4d936ec1
      Michael J. Evans 提交于
      In current release kernels the md module (Software RAID) uses a static
      array (dev_t[128]) to store partition/device info temporarily for
      autostart.
      
      I discovered this (and that the devices are added as disks/partitions are
      discovered at boot) while I was debugging why only one of my MD arrays would
      come up whole, while all the others were short a disk.
      
      I eventually discovered that it was enumerating through all of 9 of my 11 hds
      (2 had only 4 partitions apiece) while the other 9 have 15 partitions (I
      wanted 64 per drive...).  The last partition of the 8th drive in my 9 drive
      raid 5 sets wasn't added, thus making the final md array short both a parity
      and data disk, and it was started later, elsewhere.
      
      This patch replaces that static array with a list.
      
      [akpm@linux-foundation.org: removed unused var]
      Signed-off-by: NMichael J. Evans <mjevans1983@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d936ec1
    • M
      ext2 reservations · a686cd89
      Martin J. Bligh 提交于
      Val's cross-port of the ext3 reservations code into ext2.
      
      [mbligh@mbligh.org: Small type error for printk
      [akpm@linux-foundation.org: fix types, sync with ext3]
      [mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3]
      [akpm@linux-foundation.org: kill noisy printk]
      [akpm@linux-foundation.org: remember to dirty the gdp's block]
      [akpm@linux-foundation.org: cross-port the missed 5dea5176]
      [akpm@linux-foundation.org: cross-port e6022603]
      [akpm@linux-foundation.org: Port the omitted 08fb306f]
      [akpm@linux-foundation.org: Backport the missed 20acaa18]
      [akpm@linux-foundation.org: fixes]
      [cmm@us.ibm.com: fix reservation extension]
      [bunk@stusta.de: make ext2_get_blocks() static]
      [hugh@veritas.com: fix hang]
      [hugh@veritas.com: ext2_new_blocks should reset the reservation window size]
      [hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end]
      [hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such]
      [hugh@veritas.com: rbtree usage cleanup]
      [pbadari@us.ibm.com: Fix for ext2 reservation]
      [bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()]
      [hugh@veritas.com: ext2 balloc: use io_error label]
      Cc: "Martin J. Bligh" <mbligh@mbligh.org>
      Cc: Valerie Henson <val_henson@linux.intel.com>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a686cd89
    • F
      writeback: remove unnecessary wait in throttle_vm_writeout() · 369f2389
      Fengguang Wu 提交于
      We don't want to introduce pointless delays in throttle_vm_writeout() when
      the writeback limits are not yet exceeded, do we?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369f2389
    • J
      introduce I_SYNC · 1c0eeaf5
      Joern Engel 提交于
      I_LOCK was used for several unrelated purposes, which caused deadlock
      situations in certain filesystems as a side effect.  One of the purposes
      now uses the new I_SYNC bit.
      
      Also document the various bits and change their order from historical to
      logical.
      
      [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
      Signed-off-by: NJoern Engel <joern@wohnheim.fh-wedel.de>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Anton Altaparmakov <aia21@cam.ac.uk>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c0eeaf5
    • F
      writeback: introduce writeback_control.more_io to indicate more io · 2e6883bd
      Fengguang Wu 提交于
      After making dirty a 100M file, the normal behavior is to start the writeback
      for all data after 30s delays.  But sometimes the following happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      
      1) initial state with a 100M file and a 1K file
      2) 4M written, nr_to_write <= 0, so write more
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all been
      written out.  The big dirty file is actually still sitting in s_more_io.  We
      cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
      let the loop in generic_sync_sb_inodes() continue: this may starve newly
      expired inodes in s_dirty.  It is also not an option to draw inodes from both
      s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
      and might also starve other superblocks in sync time(well kupdate may still
      starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
      not necessarily mean that "all data are written".  This patch introduces a
      flag writeback_control.more_io to indicate this situation.  With it the big
      dirty file no longer has to wait for the next kupdate invocation 5s later.
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e6883bd
    • F
      writeback: remove pages_skipped accounting in __block_write_full_page() · 1f7decf6
      Fengguang Wu 提交于
      Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
      
      > The following strange behavior can be observed:
      >
      > 1. large file is written
      > 2. after 30 seconds, nr_dirty goes down by 1024
      > 3. then for some time (< 30 sec) nothing happens (disk idle)
      > 4. then nr_dirty again goes down by 1024
      > 5. repeat from 3. until whole file is written
      >
      > So basically a 4Mbyte chunk of the file is written every 30 seconds.
      > I'm quite sure this is not the intended behavior.
      
      It can be produced by the following test scheme:
      
      # cat bin/test-writeback.sh
      grep nr_dirty /proc/vmstat
      echo 1 > /proc/sys/fs/inode_debug
      dd if=/dev/zero of=/var/x bs=1K count=204800&
      while true; do grep nr_dirty /proc/vmstat; sleep 1; done
      
      # bin/test-writeback.sh
      nr_dirty 19207
      nr_dirty 19207
      nr_dirty 30924
      204800+0 records in
      204800+0 records out
      209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
      nr_dirty 47150
      nr_dirty 47141
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47205
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47215
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47154
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47134
      nr_dirty 47134
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 46097 <== -1038
      nr_dirty 46098
      nr_dirty 46098
      nr_dirty 46098
      [...]
      nr_dirty 46091
      nr_dirty 46092
      nr_dirty 46092
      nr_dirty 45069 <== -1023
      nr_dirty 45056
      nr_dirty 45056
      nr_dirty 45056
      [...]
      nr_dirty 37822
      nr_dirty 36799 <== -1023
      [...]
      nr_dirty 36781
      nr_dirty 35758 <== -1023
      [...]
      nr_dirty 34708
      nr_dirty 33672 <== -1024
      [...]
      nr_dirty 33692
      nr_dirty 32669 <== -1023
      
      % ls -li /var/x
      847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
      
      % dmesg|grep 847824  # generated by a debug printk
      [  529.263184] redirtied inode 847824 line 548
      [  564.250872] redirtied inode 847824 line 548
      [  594.272797] redirtied inode 847824 line 548
      [  629.231330] redirtied inode 847824 line 548
      [  659.224674] redirtied inode 847824 line 548
      [  689.219890] redirtied inode 847824 line 548
      [  724.226655] redirtied inode 847824 line 548
      [  759.198568] redirtied inode 847824 line 548
      
      # line 548 in fs/fs-writeback.c:
      543                 if (wbc->pages_skipped != pages_skipped) {
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      548                         redirty_tail(inode);
      549                 }
      
      More debug efforts show that __block_write_full_page()
      never has the chance to call submit_bh() for that big dirty file:
      the buffer head is *clean*. So basicly no page io is issued by
      __block_write_full_page(), hence pages_skipped goes up.
      
      Also the comment in generic_sync_sb_inodes():
      
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      
      and the comment in __block_write_full_page():
      
      1713                 /*
      1714                  * The page was marked dirty, but the buffers were
      1715                  * clean.  Someone wrote them back by hand with
      1716                  * ll_rw_block/submit_bh.  A rare case.
      1717                  */
      
      do not quite agree with each other. The page writeback should be skipped for
      'locked buffer', but here it is 'clean buffer'!
      
      This patch fixes this bug. Though I'm not sure why __block_write_full_page()
      is called only to do nothing and who actually issued the writeback for us.
      
      This is the two possible new behaviors after the patch:
      
      1) pretty nice: wait 30s and write ALL:)
      2) not so good:
      	- during the dd: ~16M
      	- after 30s:      ~4M
      	- after 5s:       ~4M
      	- after 5s:     ~176M
      
      The next patch will fix case (2).
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f7decf6
    • F
      writeback: fix ntfs with sb_has_dirty_inodes() · 08d8e974
      Fengguang Wu 提交于
      NTFS's if-condition on dirty inodes is not complete.  Fix it with
      sb_has_dirty_inodes().
      
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d8e974
    • F
      writeback: fix time ordering of the per superblock inode lists 8 · 2c136579
      Fengguang Wu 提交于
      Streamline the management of dirty inode lists and fix time ordering bugs.
      
      The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
      s_io, *only to* move them back.  The move-inodes-back-and-forth thing is a
      mess, which is eliminated by this patch.
      
      The new scheme is:
      - s_dirty acts as a time ordered io delaying queue;
      - s_io/s_more_io together acts as an io dispatching queue.
      
      On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
      every full scan of s_io.  Otherwise  (i.e. for sync/throttle/background
      writeback), we always pull from s_dirty on each run (a partial scan).
      
      Note that the line
      	list_splice_init(&sb->s_more_io, &sb->s_io);
      is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
      sit in s_io for a long time, preventing new expired inodes to get in.
      
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c136579
    • K
      writeback: fix periodic superblock dirty inode flushing · 0e0f4fc2
      Ken Chen 提交于
      Current -mm tree has bucketful of bug fixes in periodic writeback path.
      However, we still hit a glitch where dirty pages on a given inode aren't
      completely flushed to the disk, and system will accumulate large amount of
      dirty pages beyond what dirty_expire_interval is designed for.
      
      The problem is __sync_single_inode() will move an inode to sb->s_dirty list
      even when there are more pending dirty pages on that inode.  If there is
      another inode with a small number of dirty pages, we hit a case where the loop
      iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
      Thus leaving the inode that has large amount of dirty pages behind and it has
      to wait for another dirty_writeback_interval before we flush it again.  We
      effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
      If the rate of dirtying is sufficiently high, the system will start
      accumulate a large number of dirty pages.
      
      So fix it by having another sb->s_more_io list on which to park the inode
      while we iterate through sb->s_io and to allow each dirty inode which resides
      on that sb to have an equal chance of flushing some amount of dirty pages.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e0f4fc2
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 7 · 670e4def
      Andrew Morton 提交于
      This one fixes four bugs.
      
      There are a few situation in there where writeback decides it is going to skip
      over a blockdev inode on the kernel-internal blockdev superblock.  It
      presently does this by moving the blockdev inode onto the tail of the blockdev
      superblock's s_dirty.  But
      
      a) this screws up s_dirty's reverse-time-orderedness and
      
      b) refiling the blockdev for writeback in another 30 second is rude.  We
         should try again sooner than that.
      
      Fix all this up by using redirty_head(): move the blockdev inode onto the head
      of the blockdev superblock's s_dirty list for prompt writeback.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      670e4def
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 6 · 65cb9b47
      Andrew Morton 提交于
      Recycling the previous changelog:
      
        When the writeback function is operating in writeback-for-flushing mode
        (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
        it will skip writing that inode.  This is done for throughput and latency:
        move on to another inode rather than blocking for this one.
      
        Writeback skips this inode by moving it off s_io and onto s_dirty, so that
        writeback can proceed with the other inodes on s_io.
      
        However that inode movement can corrupt s_dirty's
        reverse-time-orderedness.  Fix that by using the new redirty_tail(), which
        will update the refiled inode's dirtied_when field.
      
        Note: the behaviour in here is a bit rude: if kupdate happens to come
        across a locked inode then it will defer writeback of that inode for another
        30 seconds.  We'll address that in the next patch.
      
      Address that here.  What we do is to move the skipped inode to the _head_ of
      s_dirty, immediately eligible for writeout again.  Instead of deferring that
      writeout for another 30 seconds.
      
      One would think that this might cause a livelock: we keep on trying to write
      the same locked inode.  But it won't because:
      
      a) if that was the case, it would _already_ be happening on the
         balance_dirty_pages codepath.  Because balance_dirty_pages() doesn't care
         about inode timestamps.
      
      b) if we skipped this inode then we won't have done any writeback.  The
         higher-level writeback paths will see that wbc.nr_to_write didn't change
         and they'll then back off and take a nap.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65cb9b47
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 5 · c6945e77
      Andrew Morton 提交于
      When the writeback function is operating in writeback-for-flushing mode (as
      opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
      will skip writing that inode.  This is done for throughput and latency: move
      on to another inode rather than blocking for this one.
      
      Writeback skips this inode by moving it off s_io and onto s_dirty, so that
      writeback can proceed with the other inodes on s_io.
      
      However that inode movement can corrupt s_dirty's reverse-time-orderedness.
      Fix that by using the new redirty_tail(), which will update the refiled
      inode's dirtied_when field.
      
      Note: the behaviour in here is a bit rude: if kupdate happens to come across a
      locked inode then it will defer writeback of that inode for another 30
      seconds.  We'll address that in the next patch.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6945e77
    • A
      writeback: fix comment, use helper function · 1b43ef91
      Andrew Morton 提交于
      There's a comment in there which claims that the inode is left on s_io
      if nfs chickened out of writing some data.
      
      But that's not been true for three years.
      9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these
      inodes back onto s_dirty.  Fix the comment.
      
      In the second leg of the `if', use redirty_tail() rather than open-coding it.
      
      Add weaselly comment indicating lack of confidence in the code and lack of the
      fortitude which would be needed to fiddle with it.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b43ef91
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 4 · c986d1e2
      Andrew Morton 提交于
      When the kupdate function has tried to write back an expired inode it will
      then check to see whether some of the inode's pages are still dirty.
      
      This can happen when the filesystem decided to not write a page for some
      reason.  But it does _not_ occur due to redirtyings: a redirtying will set
      I_DIRTY_PAGES.
      
      What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then
      put the inode onto the _head_ of s_dirty for consideration on the next kupdate
      pass, in five seconds time.
      
      Problem is, the code failed to modify the inode's timestamp when pushing the
      inode onto thehead of s_dirty.
      
      The patch:
      
      If there are no other inodes on s_dirty then we leave the inode's timestamp
      alone: it is already expired.
      
      If there _are_ other inodes on s_dirty then we arrange for this inode to get
      the same timestamp as the inode which is at the head of s_dirty, thus
      preserving the s_dirty ordering.  But we only need to do this if this inode
      purports to have been dirtied before the one at head-of-list.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c986d1e2
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 3 · f57b9b7b
      Andrew Morton 提交于
      While writeback is working against a dirty inode it does a check after trying
      to write some of the inode's pages:
      
      "did the lower layers skip some of the inode's dirty pages because they were
      locked (or under writeback, or whatever)"
      
      If this turns out to be true, we must move the inode back onto s_dirty and
      redirty it.  The reason for doing this is that fsync() and friends only check
      the s_dirty list, and those functions want to know about those pages which
      were locked, so they can be waited upon and, if necessary, rewritten.
      
      Problem is, that redirtying was putting the inode onto the tail of s_dirty
      without updating its timestamp.  This causes a violation of s_dirty ordering.
      
      Fix this by updating inode->dirtied_when when moving the inode onto s_dirty.
      
      But the code is still a bit buggy?  If the inode was _already_ dirty then we
      don't need to move it at all.  Oh well, hopefully it doesn't matter too much,
      as that was a redirtying, which was very recent anwyay.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f57b9b7b
    • A
      writeback: fix time ordering of the per superblock dirty inode lists: memory-backed inodes · 9852a0e7
      Andrew Morton 提交于
      For reasons which escape me, inodes which are dirty against a ram-backed
      filesystem are managed in the same way as inodes which are backed by real
      devices.
      
      Probably we could optimise things here.  But given that we skip the entire
      supeblock as son as we hit the first dirty inode, there's not a lot to be
      gained.
      
      And the code does need to handle one particular non-backed superblock: the
      kernel's fake internal superblock which holds all the blockdevs.
      
      Still.  At present when the code encounters an inode which is dirty against a
      memory-backed filesystem it will skip that inode by refiling it back onto
      s_dirty.  But it fails to update the inode's timestamp when doing so which at
      least makes the debugging code upset.
      
      Fix.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852a0e7
    • A
      writeback: fix time-ordering of the per-superblock dirty-inode lists · 6610a0bc
      Andrew Morton 提交于
      When writeback has finished writing back an inode it looks to see if that
      inode is still dirty.  If it is, that means that a process redirtied the inode
      while its writeback was in progress.
      
      What we need to do here is to refile the redirtied inode onto the s_dirty
      list.
      
      But we're doing that wrongly: it could be that this inode was redirtied
      _before_ the last inode on s_dirty.  We're blindly appending this inode to the
      list, after an inode which might be less-recently-dirtied, thus violating the
      list's ordering.
      
      So we must either insertion-sort this inode into the correct place, or we must
      update this inode's dirtied_when field when appending it to the reverse-sorted
      s_dirty list, to preserve the reverse-time-ordering.
      
      This patch does the latter: if this inode was dirtied less recently than the
      tail inode then copy the tail inode's timestamp into this inode.
      
      This means that in rare circumstances, some inodes will be writen back later
      than they should have been.  But the time slip will be small.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6610a0bc
    • J
      drivers/char/ip2: fix used-uninit'd bug · 2b0172e1
      Jeff Garzik 提交于
      Fix bug flagged by a variable-used-uninitialized warning.
      
      [akpm@linux-foundation.org: coding-style]
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b0172e1