1. 09 9月, 2008 3 次提交
  2. 10 10月, 2008 1 次提交
  3. 03 8月, 2008 1 次提交
  4. 01 8月, 2008 1 次提交
    • A
      [PATCH] fix races and leaks in vfs_quota_on() users · 77e69dac
      Al Viro 提交于
      * new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the
        pathname resolution.
      * callers of vfs_quota_on() that do their own pathname resolution and
        checks based on it are switched to vfs_quota_on_path(); that way we
        avoid the races.
      * reiserfs leaked dentry/vfsmount references on several failure exits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      77e69dac
  5. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  6. 19 8月, 2008 1 次提交
  7. 20 8月, 2008 9 次提交
    • A
      ext4: Initialize writeback_index to 0 when allocating a new inode · 91246c00
      Aneesh Kumar K.V 提交于
      The write_cache_pages() function uses the mapping->writeback_index as
      the starting index to write out when range_cyclic is set.  Properly
      initialize writeback_index so that we start the writeout at index 0.
      
      This was found when debugging the small file fragmentation on ext4.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      91246c00
    • A
      ext4: make sure ext4_has_free_blocks returns 0 for ENOSPC · 16eb7295
      Aneesh Kumar K.V 提交于
      Fix ext4_has_free_blocks() to return 0 when we don't have enough space.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      16eb7295
    • M
      ext4: journal credit fix for the delayed allocation's writepages() function · 525f4ed8
      Mingming Cao 提交于
      Previous delalloc writepages implementation started a new transaction
      outside of a loop which called get_block() to do the block allocation.
      Since we didn't know exactly how many blocks would need to be allocated,
      the estimated journal credits required was very conservative and caused
      many issues.
      
      With the reworked delayed allocation, a new transaction is created for
      each get_block(), thus we don't need to guess how many credits for the
      multiple chunk of allocation.  We start every transaction with enough
      credits for inserting a single exent.  When estimate the credits for
      indirect blocks to allocate a chunk of blocks, we need to know the
      number of data blocks to allocate.  We use the total number of reserved
      delalloc datablocks; if that is too big, for non-extent files, we need
      to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS.
      
      Code cleanup from Aneesh.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      525f4ed8
    • A
      ext4: Rework the ext4_da_writepages() function · a1d6cc56
      Aneesh Kumar K.V 提交于
      With the below changes we reserve credit needed to insert only one
      extent resulting from a call to single get_block.  This makes sure we
      don't take too much journal credits during writeout.  We also don't
      limit the pages to write.  That means we loop through the dirty pages
      building largest possible contiguous block request.  Then we issue a
      single get_block request.  We may get less block that we requested.  If
      so we would end up not mapping some of the buffer_heads.  That means
      those buffer_heads are still marked delay.  Later in the writepage
      callback via __mpage_writepage we redirty those pages.
      
      We should also not limit/throttle wbc->nr_to_write in the filesystem
      writepages callback. That cause wrong behaviour in
      generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a1d6cc56
    • M
      ext4: journal credits reservation fixes for DIO, fallocate · f3bd1f3f
      Mingming Cao 提交于
      DIO and fallocate credit calculation is different than writepage, as
      they do start a new journal right for each call to ext4_get_blocks_wrap().
      This patch uses the helper function in DIO and fallocate case, passing
      a flag indicating that the modified data are contigous thus could account
      less indirect/index blocks.
      
      This patch also fixed the journal credit reservation for direct I/O
      (DIO).  Previously the estimated credits for DIO only was calculated for
      non-extent files, which was not enough if the file is extent-based.
      
      Also fixed was fallocate double-counting credits for modifying the the
      superblock.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f3bd1f3f
    • M
      ext4: journal credits reservation fixes for extent file writepage · ee12b630
      Mingming Cao 提交于
      This patch modified the writepage/write_begin credit calculation for
      extent files, to use the credits caculation helper function.
      
      The current calculation of how many index/leaf blocks should be
      accounted is too conservetive, it always considered the worse case,
      where the tree level is 5, and in the case of multiple chunk
      allocations, it always assumed no blocks were dirtied in common across
      the allocations. This path uses the accurate depth of the inode with
      some extras to calculate the index blocks, and also less conservative in
      the case of multiple allocation accounting.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ee12b630
    • M
      ext4: journal credits calulation cleanup and fix for non-extent writepage · a02908f1
      Mingming Cao 提交于
      When considering how many journal credits are needed for modifying a
      chunk of data, we need to account for the super block, inode block,
      quota blocks and xattr block, indirect/index blocks, also, group bitmap
      and group descriptor blocks for new allocation (including data and
      indirect/index blocks). There are many places in ext4 do the calculation
      on their own and often missed one or two meta blocks, and often they
      assume single block allocation, and did not considering the multile
      chunk of allocation case.
      
      This patch is trying to cleanup current journal credit code, provides
      some common helper funtion to calculate the journal credits, to be used
      for writepage, writepages, DIO, fallocate, migration, defrag, and for
      both nonextent and extent files.
      
      This patch modified the writepage/write_begin credit caculation for
      nonextent files, to use the new helper function. It also fixed the
      problem that writepage on nonextent files did not consider the case
      blocksize <pagesize, thus could possibelly need multiple block
      allocation in a single transaction.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a02908f1
    • E
      ext4: Fix bug where we return ENOSPC even though we have plenty of inodes · c001077f
      Eric Sandeen 提交于
      The find_group_flex() function starts with best_flex as the
      parent_fbg_group, which happens to have 0 inodes free.  Some of the
      flex groups searched have free blocks and free inodes, but the
      flex_freeb_ratio is < 10, so they're skipped.  Then when a group is
      compared to the current "best" flex group, it does not have more free
      blocks than "best", so it is skipped as well.
      
      This continues until no flex group with free inodes is found which has
      a proper ratio or which has more free blocks than the "best" group,
      and we're left with a "best" group that has 0 inodes free, and we
      return -ENOSPC.
      
      We fix this by changing the logic so that if the current "best" flex
      group has no inodes free, and the current one does have room, it is
      promoted to the next "best."
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c001077f
    • J
      ext4: don't try to resize if there are no reserved gdt blocks left · 37609fd5
      Josef Bacik 提交于
      When trying to resize an ext4 fs and you run out of reserved gdt blocks,
      you get an error that doesn't actually tell you what went wrong, it just
      says that the gdb it picked is not correct, which is the case since you
      don't have any reserved gdt blocks left.  This patch adds a check to make
      sure you have reserved gdt blocks to use, and if not prints out a more
      relevant error.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Andreas Dilger <adilger@sun.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      37609fd5
  8. 16 8月, 2008 1 次提交
  9. 20 8月, 2008 2 次提交
  10. 14 8月, 2008 1 次提交
  11. 20 8月, 2008 1 次提交
  12. 27 7月, 2008 3 次提交
  13. 25 7月, 2008 1 次提交
  14. 18 7月, 2008 1 次提交
  15. 02 8月, 2008 1 次提交
  16. 03 8月, 2008 1 次提交
    • T
      ext4: Fix lack of credits BUG() when deleting a badly fragmented inode · bc965ab3
      Theodore Ts'o 提交于
      The extents codepath for ext4_truncate() requests journal transaction
      credits in very small chunks, requesting only what is needed.  This
      means there may not be enough credits left on the transaction handle
      after ext4_truncate() returns and then when ext4_delete_inode() tries
      finish up its work, it may not have enough transaction credits,
      causing a BUG() oops in the jbd2 core.
      
      Also, reserve an extra 2 blocks when starting an ext4_delete_inode()
      since we need to update the inode bitmap, as well as update the
      orphaned inode linked list.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bc965ab3
  17. 02 8月, 2008 1 次提交
    • T
      ext4: Fix ext4_ext_journal_restart() · 0123c939
      Theodore Ts'o 提交于
      The ext4_ext_journal_restart() is a convenience function which checks
      to see if the requested number of credits is present, and if so it
      closes the current transaction and attaches the current handle to the
      new transaction.  Unfortunately, it wasn't proprely checking the
      return value from ext4_journal_extend(), so it was starting a new
      transaction when one was not necessary, and returning an error when
      all that was necessary was to restart the handle with a new
      transaction.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0123c939
  18. 03 8月, 2008 1 次提交
  19. 27 7月, 2008 1 次提交
    • H
      ext4: don't read inode block if the buffer has a write error · 9c83a923
      Hidehiro Kawai 提交于
      A transient I/O error can corrupt inode data.  Here is the scenario:
      
      (1) update inode_A at the block_B
      (2) pdflush writes out new inode_A to the filesystem, but it results
          in write I/O error, at this point, BH_Uptodate flag of the buffer
          for block_B is cleared and BH_Write_EIO is set
      (3) create new inode_C which located at block_B, and
          __ext4_get_inode_loc() tries to read on-disk block_B because the
          buffer is not uptodate
      (4) if it can read on-disk block_B successfully, inode_A is
          overwritten by old data
      
      This patch makes __ext4_get_inode_loc() not read the inode block if the
      buffer has BH_Write_EIO flag.  In this case, the buffer should have the
      latest information, so setting the uptodate flag to the buffer (this
      avoids WARN_ON_ONCE() in mark_buffer_dirty().)
      
      According to this change, we would need to test BH_Write_EIO flag for the
      error checking.  Currently nobody checks write I/O errors on metadata
      buffers, but it will be done in other patches I'm working on.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9c83a923
  20. 24 7月, 2008 3 次提交
  21. 03 8月, 2008 2 次提交
    • E
      ext4: lock block groups when initializing · b5f10eed
      Eric Sandeen 提交于
      I noticed when filling a 1T filesystem with 4 threads using the
      fs_mark benchmark:
      
      fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
      
      that I occasionally got checksum mismatch errors:
      
      EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935
      
      etc.  I'd reliably get 4-5 of them during the run.
      
      It appears that the problem is likely a race to init the bg's
      when the uninit_bg feature is enabled.
      
      With the patch below, which adds sb_bgl_locking around initialization,
      I was able to complete several runs with no errors or warnings.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b5f10eed
    • E
      ext4: sync up block and inode bitmap reading functions · e29d1cde
      Eric Sandeen 提交于
      ext4_read_block_bitmap and read_inode_bitmap do essentially
      the same thing, and yet they are structured quite differently.
      I came across this difference while looking at doing bg locking
      during bg initialization.
      
      This patch:
      
      * removes unnecessary casts in the error messages
      * renames read_inode_bitmap to ext4_read_inode_bitmap
      * and more substantially, restructures the inode bitmap
        reading function to be more like the block bitmap counterpart.
      
      The change to the inode bitmap reader simplifies the locking
      to be applied in the next patch.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e29d1cde
  22. 27 7月, 2008 1 次提交
  23. 03 8月, 2008 1 次提交
  24. 12 7月, 2008 1 次提交
    • E
      ext4: do not set extents feature from the kernel · e4079a11
      Eric Sandeen 提交于
      We've talked for a while about getting rid of any feature-
      setting from the kernel; this gets rid of the code which would
      set the INCOMPAT_EXTENTS flag on the first file write when mounted
      as ext4[dev].
      
      With this patch, if the extents feature is not already set on disk,
      then mounting as ext4 will fall back to noextents with a warning,
      and if -o extents is explicitly requested, the mount will fail,
      also with warning.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e4079a11