1. 16 9月, 2009 3 次提交
    • J
      ext3: Flush disk caches on fsync when needed · 56fcad29
      Jan Kara 提交于
      In case we fsync() a file and inode is not dirty, we don't force a transaction
      to disk and hence don't flush disk caches. Thus file data could be just in disk
      caches and not on persistent storage. Fix the problem by flushing disk caches
      if we didn't force a transaction commit.
      Signed-off-by: NJan Kara <jack@suse.cz>
      56fcad29
    • C
      ext3: Add locking to ext3_do_update_inode · 4f003fd3
      Chris Mason 提交于
      I've been struggling with this off and on while I've been testing the
      data=guarded work.  The symptom is corrupted orphan lists and inodes
      with the wrong i_size stored on disk.  I was convinced the
      data=guarded code was just missing a call to ext3_mark_inode_dirty, but
      tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
      wasn't actually making it to the drive.
      
      ext3_mark_inode_dirty can be called without locks held (atime updates
      and a few others), so the data=guarded code uses locks while updating
      the in-memory inode, and then calls ext3_mark_inode_dirty
      without any locks held.
      
      But, ext3_mark_inode_dirty has no internal locking to make sure that
      only one CPU is updating the buffer head at a time.  Generally this
      works out ok because everyone that changes the inode then calls
      ext3_mark_inode_dirty themselves.  Even though it races, eventually
      someone updates the buffer heads and things move on.
      
      But there is still a risk of the wrong values getting in, and the
      data=guarded code seems to hit the race very often.
      
      Since everyone that changes the inode also logs it, it should be
      possible to fix this with some memory barriers.  I'll leave that as an
      exercise to the reader and lock the buffer head instead.
      
      It it probably a good idea to have a different patch series for lockless
      bit flipping on the ext3 i_state field.  ext3_do_update_inode &= clears
      EXT3_STATE_NEW without any locks held.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      4f003fd3
    • J
      ext3: Fix possible deadlock between ext3_truncate() and ext3_get_blocks() · 00171d3c
      Jan Kara 提交于
      During truncate we are sometimes forced to start a new transaction as the
      amount of blocks to be journaled is both quite large and hard to predict. So
      far we restarted a transaction while holding truncate_mutex and that violates
      lock ordering because truncate_mutex ranks below transaction start (and it
      can lead to a real deadlock with ext3_get_blocks() allocating new blocks
      from ext3_writepage()).
      
      Luckily, the problem is easy to fix: We just drop the truncate_mutex before
      restarting the transaction and acquire it afterwards. We are safe to do this as
      by the time ext3_truncate() is called, all the page cache for the truncated
      part of the file is dropped and so writepage() cannot come and allocate new
      blocks in the part of the file we are truncating. The rest of writers is
      stopped by us holding i_mutex.
      Signed-off-by: NJan Kara <jack@suse.cz>
      00171d3c
  2. 14 9月, 2009 1 次提交
  3. 09 9月, 2009 1 次提交
  4. 24 8月, 2009 2 次提交
    • J
      ext3: Improve error message that changing journaling mode on remount is not possible · 3c4cec65
      Jan Kara 提交于
      This patch makes the error message about changing journaling mode on remount
      more descriptive. Some people are going to hit this error now due to commit
      bbae8bcc if they configure a kernel to default
      to data=writeback mode. The problem happens if they have data=ordered set for
      the root filesystem in /etc/fstab but not in the kernel command line (and they
      don't use initrd). Their filesystem then gets mounted as data=writeback by
      kernel but then their boot fails because init scripts won't be able to remount
      the filesystem rw. Better error message will hopefully make it easier for them
      to find the error in their setup and bother us less with error reports :).
      Signed-off-by: NJan Kara <jack@suse.cz>
      3c4cec65
    • T
      ext3: Update Kconfig description of EXT3_DEFAULTS_TO_ORDERED · 6d418076
      Theodore Ts'o 提交于
      The old description for this configuration option was perhaps not
      completely balanced in terms of describing the tradeoffs of using a
      default of data=writeback vs. data=ordered.  Despite the fact that old
      description very strongly recomended disabling this feature, all of
      the major distributions have elected to preserve the existing 'legacy'
      default, which is a strong hint that it perhaps wasn't telling the
      whole story.
      
      This revised description has been vetted by a number of ext3
      developers as being better at informing the user about the tradeoffs
      of enabling or disabling this configuration feature.
      
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJan Kara <jack@suse.cz>
      6d418076
  5. 16 7月, 2009 2 次提交
    • J
      ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() · 43237b54
      Jan Kara 提交于
      Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
      be a relict from some old days and setting disksize in this function does not
      make much sence. Currently it was set only by ext3_getblk().  Since the
      parameter has some effect only if create == 1, it is easy to check that the
      three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
      ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.
      Signed-off-by: NJan Kara <jack@suse.cz>
      43237b54
    • J
      ext3: Fix truncation of symlinks after failed write · 9eaaa2d5
      Jan Kara 提交于
      Contents of long symlinks is written via standard write methods. So when the
      write fails, we add inode to orphan list. But symlinks don't have .truncate
      method defined so nobody properly removes them from the orphan list (both on
      disk and in memory).
      
      Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
      (which is saner anyway since we don't need anything vmtruncate() does except
      from calling .truncate in these paths).  We also add inode to orphan list only
      if ext3_can_truncate() is true (currently, it can be false for symlinks when
      there are no blocks allocated) - otherwise orphan list processing will complain
      and ext3_truncate() will not remove inode from on-disk orphan list.
      Signed-off-by: NJan Kara <jack@suse.cz>
      9eaaa2d5
  6. 24 6月, 2009 2 次提交
  7. 19 6月, 2009 3 次提交
  8. 17 6月, 2009 1 次提交
    • L
      ext3: avoid unnecessary spinlock in critical POSIX ACL path · 9c64daff
      Linus Torvalds 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
      to do POSIX ACL checks on any files not owned by the caller, and it does
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful
      about it, especially with locking. That's doubly sad, since the common
      case tends to be that there are no ACL's associated with the files in
      question.
      
      ext3 already caches the ACL data so that it doesn't have to look it up
      over and over again, but it does so by taking the inode->i_lock spinlock
      on every lookup. Which is a noticeable overhead even if it's a private
      lock, especially on CPU's where the serialization is expensive (eg Intel
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is
      unnecessary. Even if somebody else were to be changing the ACL's on
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was
      NULL, just use it. If it's non-NULL (either because we had a cached
      entry, or because the cache hasn't been filled in at all), it means that
      we'll need to get the lock and re-load it properly.
      
      This is noticeable even on Nehalem, which does locking quite well (much
      better than P4). From lmbench:
      
      	Processor, Processes - times in microseconds - smaller is better
      	--------------------------------------------------------------------
      	Host                 OS  Mhz null null      open slct fork exec sh
      	                             call  I/O stat clos TCP  proc proc proc
      	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
       - before:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
      	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
       - after:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148
      
      where you can see what appears to be a roughly 3% improvement in stat
      and open/close latencies from just the removal of the locking overhead.
      
      Of course, this only matters for files you don't own (the owner never
      needs to do the ACL checks), but that's the common case for libraries,
      header files, and executables. As well as for the base components of any
      absolute pathname, even if you are the owner of the final file.
      
      [ At some point we probably want to move this ACL caching logic entirely
        into the VFS layer (and only call down to the filesystem when
        uncached), but in the meantime this improves ext3 a bit.
      
        A similar fix to btrfs makes a much bigger difference (15x improvement
        in lmbench) due to broken caching. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c64daff
  9. 12 6月, 2009 4 次提交
  10. 23 5月, 2009 1 次提交
  11. 18 5月, 2009 1 次提交
  12. 28 4月, 2009 1 次提交
    • L
      ext3: avoid unnecessary spinlock in critical POSIX ACL path · 96159f25
      Linus Torvalds 提交于
      If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem 
      to do POSIX ACL checks on any files not owned by the caller, and it does 
      this for every single pathname component that it looks up.
      
      That obviously can be pretty expensive if the filesystem isn't careful 
      about it, especially with locking. That's doubly sad, since the common 
      case tends to be that there are no ACL's associated with the files in 
      question.
      
      ext3 already caches the ACL data so that it doesn't have to look it up 
      over and over again, but it does so by taking the inode->i_lock spinlock 
      on every lookup. Which is a noticeable overhead even if it's a private 
      lock, especially on CPU's where the serialization is expensive (eg Intel 
      Netburst aka 'P4').
      
      For the special case of not actually having any ACL's, all that locking is 
      unnecessary. Even if somebody else were to be changing the ACL's on 
      another CPU, we simply don't care - if we've seen a NULL ACL, we might as 
      well use it.
      
      So just load the ACL speculatively without any locking, and if it was 
      NULL, just use it. If it's non-NULL (either because we had a cached 
      entry, or because the cache hasn't been filled in at all), it means that 
      we'll need to get the lock and re-load it properly.
      
      This is noticeable even on Nehalem, which does locking quite well (much 
      better than P4). From lmbench:
      
      	Processor, Processes - times in microseconds - smaller is better
      	--------------------------------------------------------------------
      	Host                 OS  Mhz null null      open slct fork exec sh  
      	                             call  I/O stat clos TCP  proc proc proc
      	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
       - before:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
      	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
       - after:
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
      	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148
      
      where you can see what appears to be a roughly 3% improvement in stat
      and open/close latencies from just the removal of the locking overhead. 
      
      Of course, this only matters for files you don't own (the owner never 
      needs to do the ACL checks), but that's the common case for libraries, 
      header files, and executables. As well as for the base components of any 
      absolute pathname, even if you are the owner of the final file.
      
      [ At some point we probably want to move this ACL caching logic entirely
        into the VFS layer (and only call down to the filesystem when
        uncached), but in the meantime this improves ext3 a bit.
      
        A similar fix to btrfs makes a much bigger difference (15x improvement
        in lmbench) due to broken caching. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      96159f25
  13. 09 4月, 2009 1 次提交
  14. 07 4月, 2009 1 次提交
  15. 03 4月, 2009 6 次提交
  16. 01 4月, 2009 1 次提交
  17. 27 3月, 2009 1 次提交
  18. 26 3月, 2009 2 次提交
  19. 12 2月, 2009 1 次提交
  20. 17 1月, 2009 1 次提交
  21. 10 1月, 2009 1 次提交
    • T
      filesystem freeze: add error handling of write_super_lockfs/unlockfs · c4be0c1d
      Takashi Sato 提交于
      Currently, ext3 in mainline Linux doesn't have the freeze feature which
      suspends write requests.  So, we cannot take a backup which keeps the
      filesystem's consistency with the storage device's features (snapshot and
      replication) while it is mounted.
      
      In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
      and it would be used to get the consistent backup.
      
      If Linux's standard filesystem ext3 has the freeze feature, we can do it
      without a commercial filesystem.
      
      So I have implemented the ioctls of the freeze feature.
      I think we can take the consistent backup with the following steps.
      1. Freeze the filesystem with the freeze ioctl.
      2. Separate the replication volume or create the snapshot
         with the storage device's feature.
      3. Unfreeze the filesystem with the unfreeze ioctl.
      4. Take the backup from the separated replication volume
         or the snapshot.
      
      This patch:
      
      VFS:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they can return an error.
      Rename write_super_lockfs and unlockfs of the super block operation
      freeze_fs and unfreeze_fs to avoid a confusion.
      
      ext3, ext4, xfs, gfs2, jfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that write_super_lockfs returns an error if needed,
      and unlockfs always returns 0.
      
      reiserfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they always return 0 (success) to keep a current behavior.
      Signed-off-by: NTakashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: NMasayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
      Cc: <xfs-masters@oss.sgi.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4be0c1d
  22. 09 1月, 2009 3 次提交