1. 05 8月, 2010 1 次提交
    • E
      ext4: re-inline ext4_rec_len_(to|from)_disk functions · 0cfc9255
      Eric Sandeen 提交于
      commit 3d0518f4, "ext4: New rec_len encoding for very
      large blocksizes" made several changes to this path, but from
      a perf perspective, un-inlining ext4_rec_len_from_disk() seems
      most significant.  This function is called from ext4_check_dir_entry(),
      which on a file-creation workload is called extremely often.
      
      I tested this with bonnie:
      
      # bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32
      
      (this does 200 iterations) and got this for the file creations:
      
      ext4 stock:   Average =  21206.8 files/s
      ext4 inlined: Average =  22346.7 files/s  (+5%)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0cfc9255
  2. 27 7月, 2010 1 次提交
  3. 15 6月, 2010 1 次提交
  4. 17 5月, 2010 4 次提交
  5. 05 3月, 2010 2 次提交
    • C
      dquot: cleanup dquot initialize routine · 871a2931
      Christoph Hellwig 提交于
      Get rid of the initialize dquot operation - it is now always called from
      the filesystem and if a filesystem really needs it's own (which none
      currently does) it can just call into it's own routine directly.
      
      Rename the now static low-level dquot_initialize helper to __dquot_initialize
      and vfs_dq_init to dquot_initialize to have a consistent namespace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      871a2931
    • C
      dquot: move dquot initialization responsibility into the filesystem · 907f4554
      Christoph Hellwig 提交于
      Currently various places in the VFS call vfs_dq_init directly.  This means
      we tie the quota code into the VFS.  Get rid of that and make the
      filesystem responsible for the initialization.   For most metadata operations
      this is a straight forward move into the methods, but for truncate and
      open it's a bit more complicated.
      
      For truncate we currently only call vfs_dq_init for the sys_truncate case
      because open already takes care of it for ftruncate and open(O_TRUNC) - the
      new code causes an additional vfs_dq_init for those which is harmless.
      
      For open the initialization is moved from do_filp_open into the open method,
      which means it happens slightly earlier now, and only for regular files.
      The latter is fine because we don't need to initialize it for operations
      on special files, and we already do it as part of the namespace operations
      for directories.
      
      Add a dquot_file_open helper that filesystems that support generic quotas
      can use to fill in ->open.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      907f4554
  6. 02 3月, 2010 1 次提交
  7. 17 2月, 2010 1 次提交
    • C
      ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode · 73b50c1c
      Curt Wohlgemuth 提交于
      Calls to ext4_handle_dirty_metadata should only pass in an inode
      pointer for inode-specific metadata, and not for shared metadata
      blocks such as inode table blocks, block group descriptors, the
      superblock, etc.
      
      The BUG_ON can get tripped when updating a special device (such as a
      block device) that is opened (so that i_mapping is set in
      fs/block_dev.c) and the file system is mounted in no journal mode.
      
      Addresses-Google-Bug: #2404870
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      73b50c1c
  8. 16 2月, 2010 1 次提交
  9. 09 12月, 2009 1 次提交
  10. 23 11月, 2009 1 次提交
    • T
      ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC · 2de770a4
      Theodore Ts'o 提交于
      Previously add_dirent_to_buf() did not free its passed-in buffer head
      in the case of ENOSPC, since in some cases the caller still needed it.
      However, this led to potential buffer head leaks since not all callers
      dealt with this correctly.  Fix this by making simplifying the freeing
      convention; now add_dirent_to_buf() *never* frees the passed-in buffer
      head, and leaves that to the responsibility of its caller.  This makes
      things cleaner and easier to prove that the code is neither leaking
      buffer heads or calling brelse() one time too many.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Curt Wohlgemuth <curtw@google.com>
      Cc: stable@kernel.org
      2de770a4
  11. 09 11月, 2009 1 次提交
    • T
      ext4: partial revert to fix double brelse WARNING() · 1e424a34
      Theodore Ts'o 提交于
      This is a partial revert of commit 6487a9d3 (only the changes made to
      fs/ext4/namei.c), since it is causing the following brelse()
      double-free warning when running fsstress on a file system with 1k
      blocksize and we run into a block allocation failure while converting
      a single-block directory to a multi-block hash-tree indexed directory.
      
      WARNING: at fs/buffer.c:1197 __brelse+0x2e/0x33()
      Hardware name: 
      VFS: brelse: Trying to free free buffer
      Modules linked in:
      Pid: 2226, comm: jbd2/sdd-8 Not tainted 2.6.32-rc6-00577-g0003f55 #101
      Call Trace:
       [<c01587fb>] warn_slowpath_common+0x65/0x95
       [<c0158869>] warn_slowpath_fmt+0x29/0x2c
       [<c021168e>] __brelse+0x2e/0x33
       [<c0288a9f>] jbd2_journal_refile_buffer+0x67/0x6c
       [<c028a9ed>] jbd2_journal_commit_transaction+0x319/0x14d8
       [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
       [<c0175bcc>] ? sched_clock_cpu+0x12a/0x13e
       [<c017f6b4>] ? trace_hardirqs_off+0xb/0xd
       [<c0175c1f>] ? cpu_clock+0x3f/0x5b
       [<c017f6ec>] ? lock_release_holdtime+0x36/0x137
       [<c0664ad0>] ? _spin_unlock_irqrestore+0x44/0x51
       [<c0180af3>] ? trace_hardirqs_on_caller+0x103/0x124
       [<c0180b1f>] ? trace_hardirqs_on+0xb/0xd
       [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
       [<c0290d1c>] kjournald2+0x11a/0x310
       [<c017118e>] ? autoremove_wake_function+0x0/0x38
       [<c0290c02>] ? kjournald2+0x0/0x310
       [<c0170ee6>] kthread+0x66/0x6b
       [<c0170e80>] ? kthread+0x0/0x6b
       [<c01251b3>] kernel_thread_helper+0x7/0x10
      ---[ end trace 5579351b86af61e3 ]---
      
      Commit 6487a9d3 was an attempt some buffer head leaks in an ENOSPC
      error path, but in some cases it actually results in an excess ENOSPC,
      as shown above.  Fixing this means cleaning up who is responsible for
      releasing the buffer heads from the callee to the caller of
      add_dirent_to_buf().
      
      Since that's a relatively complex change, and we're late in the rcX
      development cycle, I'm reverting this now, and holding back a more
      complete fix until after 2.6.32 ships.  We've lived with this
      buffer_head leak on ENOSPC in ext3 and ext4 for a very long time; a
      few more months won't kill us.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Curt Wohlgemuth <curtw@google.com>
      1e424a34
  12. 29 9月, 2009 1 次提交
  13. 11 9月, 2009 1 次提交
  14. 09 9月, 2009 1 次提交
  15. 30 8月, 2009 1 次提交
  16. 29 8月, 2009 1 次提交
  17. 17 7月, 2009 1 次提交
    • C
      ext4: More buffer head reference leaks · 6487a9d3
      Curt Wohlgemuth 提交于
      After the patch I posted last week regarding buffer head ref leaks in
      no-journal mode, I looked at all the code that uses buffer heads and
      searched for more potential leaks.
      
      The patch below fixes the issues I found; these can occur even when a
      journal is present.
      
      The change to inode.c fixes a double release if
      ext4_journal_get_create_access() fails.
      
      The changes to namei.c are more complicated.  add_dirent_to_buf() will
      release the input buffer head EXCEPT when it returns -ENOSPC.  There are
      some callers of this routine that don't always do the brelse() in the event
      that -ENOSPC is returned.  Unfortunately, to put this fix into ext4_add_entry()
      required capturing the return value of make_indexed_dir() and
      add_dirent_to_buf().
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6487a9d3
  18. 13 6月, 2009 2 次提交
    • A
      ext4: teach the inode allocator to use a goal inode number · 11013911
      Andreas Dilger 提交于
      Enhance the inode allocator to take a goal inode number as a
      paremeter; if it is specified, it takes precedence over Orlov or
      parent directory inode allocation algorithms.
      
      The extents migration function uses the goal inode number so that the
      extent trees allocated the migration function use the correct flex_bg.
      In the future, the goal inode functionality will also be used to
      allocate an adjacent inode for the extended attributes.
      
      Also, for testing purposes the goal inode number can be specified via
      /sys/fs/{dev}/inode_goal.  This can be useful for testing inode
      allocation beyond 2^32 blocks on very large filesystems.
      Signed-off-by: NAndreas Dilger <adilger@sun.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      11013911
    • T
      ext4: Use a hash of the topdir directory name for the Orlov parent group · f157a4aa
      Theodore Ts'o 提交于
      Instead of using a random number to determine the goal parent grop for
      the Orlov top directories, use a hash of the directory name.  This
      allows for repeatable results when trying to benchmark filesystem
      layout algorithms.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f157a4aa
  19. 09 6月, 2009 1 次提交
    • T
      ext4: fix dx_map_entry to support 256k directory blocks · 9aee2286
      Toshiyuki Okajima 提交于
      The dx_map_entry structure doesn't support over 64KB block size by
      current usage of its member("offs"). Because "offs" treats an offset
      of copies of the ext4_dir_entry_2 structure as is. This member size is
      16 bits. But real offset for over 64KB(256KB) block size needs 18
      bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
      not used.
      
      Therefore, we do the following to fix this limitation:
      For "store": 
      	we divide the real offset by 4 and then store this result to "offs" 
      	member.
      For "use":
      	we multiply "offs" member by 4 and then use this result 
      	as real offset.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9aee2286
  20. 03 5月, 2009 1 次提交
  21. 02 5月, 2009 1 次提交
  22. 26 4月, 2009 1 次提交
  23. 26 3月, 2009 1 次提交
  24. 17 3月, 2009 1 次提交
    • T
      ext4: Add auto_da_alloc mount option · afd4672d
      Theodore Ts'o 提交于
      Add a mount option which allows the user to disable automatic
      allocation of blocks whose allocation by delayed allocation when the
      file was originally truncated or when the file is renamed over an
      existing file.  This feature is intended to save users from the
      effects of naive application writers, but it reduces the effectiveness
      of the delayed allocation code.  This mount option disables this
      safety feature, which may be desirable for prodcutions systems where
      the risk of unclean shutdowns or unexpected system crashes is low.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      afd4672d
  25. 24 2月, 2009 1 次提交
    • T
      ext4: Automatically allocate delay allocated blocks on rename · 8750c6d5
      Theodore Ts'o 提交于
      When renaming a file such that a link to another inode is overwritten,
      force any delay allocated blocks that to be allocated so that if the
      filesystem is mounted with data=ordered, the data blocks will be
      pushed out to disk along with the journal commit.  Many application
      programs expect this, so we do this to avoid zero length files if the
      system crashes unexpectedly.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8750c6d5
  26. 23 2月, 2009 1 次提交
    • B
      ext4: return -EIO not -ESTALE on directory traversal through deleted inode · e6f009b0
      Bryan Donlan 提交于
      ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
      report errors to NFS properly.  However, in ext4_lookup(), this
      -ESTALE can be propagated to userspace if the filesystem is corrupted
      such that a directory entry references a deleted inode.  This leads to
      a misleading error message - "Stale NFS file handle" - and confusion
      on the part of the admin.
      
      The bug can be easily reproduced by creating a new filesystem, making
      a link to an unused inode using debugfs, then mounting and attempting
      to ls -l said link.
      
      This patch thus changes ext4_lookup to return -EIO if it receives
      -ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
      corruption; and also invokes the appropriate ext*_error functions when
      this case is detected.
      Signed-off-by: NBryan Donlan <bdonlan@gmail.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e6f009b0
  27. 15 2月, 2009 2 次提交
  28. 17 1月, 2009 1 次提交
  29. 09 1月, 2009 1 次提交
  30. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  31. 01 1月, 2009 1 次提交
  32. 05 11月, 2008 1 次提交
    • T
      ext4: Change unsigned long to unsigned int · 498e5f24
      Theodore Ts'o 提交于
      Convert the unsigned longs that are most responsible for bloating the
      stack usage on 64-bit systems.
      
      Nearly all places in the ext3/4 code which uses "unsigned long" is
      probably a bug, since on 32-bit systems a ulong a 32-bits, which means
      we are wasting stack space on 64-bit systems.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      498e5f24
  33. 07 1月, 2009 1 次提交
    • F
      ext4: Allow ext4 to run without a journal · 0390131b
      Frank Mayhar 提交于
      A few weeks ago I posted a patch for discussion that allowed ext4 to run
      without a journal.  Since that time I've integrated the excellent
      comments from Andreas and fixed several serious bugs.  We're currently
      running with this patch and generating some performance numbers against
      both ext2 (with backported reservations code) and ext4 with and without
      a journal.  It just so happens that running without a journal is
      slightly faster for most everything.
      
      We did
      	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2
      
      which creates 4 threads, each of which create and do reads and writes on
      a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
      to bypass the page cache.  Results:
      
                           ext2        ext4, default   ext4, no journal
        initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
        rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
        reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
        re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
        random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
        random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 
      
      So it seems that, so far, this was a useful exercise.
      Signed-off-by: NFrank Mayhar <fmayhar@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0390131b
  34. 07 12月, 2008 1 次提交