1. 03 8月, 2008 1 次提交
  2. 27 7月, 2008 1 次提交
    • H
      ext4: don't read inode block if the buffer has a write error · 9c83a923
      Hidehiro Kawai 提交于
      A transient I/O error can corrupt inode data.  Here is the scenario:
      
      (1) update inode_A at the block_B
      (2) pdflush writes out new inode_A to the filesystem, but it results
          in write I/O error, at this point, BH_Uptodate flag of the buffer
          for block_B is cleared and BH_Write_EIO is set
      (3) create new inode_C which located at block_B, and
          __ext4_get_inode_loc() tries to read on-disk block_B because the
          buffer is not uptodate
      (4) if it can read on-disk block_B successfully, inode_A is
          overwritten by old data
      
      This patch makes __ext4_get_inode_loc() not read the inode block if the
      buffer has BH_Write_EIO flag.  In this case, the buffer should have the
      latest information, so setting the uptodate flag to the buffer (this
      avoids WARN_ON_ONCE() in mark_buffer_dirty().)
      
      According to this change, we would need to test BH_Write_EIO flag for the
      error checking.  Currently nobody checks write I/O errors on metadata
      buffers, but it will be done in other patches I'm working on.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9c83a923
  3. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  4. 12 7月, 2008 5 次提交
    • M
      ext4: delayed allocation i_blocks fix for stat · 3e3398a0
      Mingming Cao 提交于
      Right now i_blocks is not getting updated until the blocks are actually
      allocaed on disk.  This means with delayed allocation, right after files
      are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
      also shows the files taking zero space, which is highly confusing to the
      user.
      
      Since delayed allocation already keeps track of per-inode total
      number of blocks that are subject to delayed allocation, this patch fix
      this by using that to adjust the value returned by stat(2). When real
      block allocation is done, the i_blocks will get updated. Since the
      reserved blocks for delayed allocation will be decreased, this will be
      keep value returned by stat(2) consistent.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e3398a0
    • M
      ext4: fix delalloc i_disksize early update issue · 632eaeab
      Mingming Cao 提交于
      Ext4_da_write_end() used walk_page_buffers() with a callback function of
      ext4_bh_unmapped_or_delay() to check if it extended the file size
      without allocating any blocks (since in this case i_disksize needs to be
      updated).  However, this is didn't work proprely because the buffer head
      has not been marked dirty yet --- this is done later in
      block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
      always return false.
      
      In addition, walk_page_buffers() checks all of the buffer heads covering
      the page, and the only buffer_head that should be checked is the one
      covering the end of the write.  Otherwise, given a 1k blocksize
      filesystem and a 4k page size, the buffer head covering the first 1k
      stripe of the file could be unmapped (because it was a sparse file), and
      the second or third buffer_head covering that page could be mapped, and
      using walk_page_buffers() would fail in this case since it would stop at
      the first unmapped buffer_head and return true.
      
      The core problem is that walk_page_buffers() was intended to do work in
      a callback function, and a non-zero return value indicated a failure,
      which termined the walk of the buffer heads covering the page.  It was
      not intended to be used with a boolean function, such as
      ext4_bh_unmapped_or_delay().
      
      Add addtional fix from Aneesh to protect i_disksize update rave with truncate.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      632eaeab
    • A
      ext4: Handle page without buffers in ext4_*_writepage() · f0e6c985
      Aneesh Kumar K.V 提交于
      It can happen that buffers are removed from the page before it gets
      marked dirty and then is passed to writepage().  In writepage() we just
      initialize the buffers and check whether they are mapped and non
      delay. If they are mapped and non delay we write the page. Otherwise we
      mark them dirty.  With this change we don't do block allocation at all
      in ext4_*_write_page.
      
      writepage() can get called under many condition and with a locking order
      of journal_start -> lock_page, we should not try to allocate blocks in
      writepage() which get called after taking page lock.  writepage() can
      get called via shrink_page_list even with a journal handle which was
      created for doing inode update.  For example when doing
      ext4_da_write_begin we create a journal handle with credit 1 expecting a
      i_disksize update for the inode. But ext4_da_write_begin can cause
      shrink_page_list via _grab_page_cache. So having a valid handle via
      ext4_journal_current_handle is not a guarantee that we can use the
      handle for block allocation in writepage, since we shouldn't be using
      credits that had been reserved for other updates.  That it could result
      in we running out of credits when we update inodes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f0e6c985
    • A
      ext4: Add ordered mode support for delalloc · cd1aac32
      Aneesh Kumar K.V 提交于
      This provides a new ordered mode implementation which gets rid of using
      buffer heads to enforce the ordering between metadata change with the
      related data chage.  Instead, in the new ordering mode, it keeps track
      of all of the inodes touched by each transaction on a list, and when
      that transaction is committed, it flushes all of the dirty pages for
      those inodes.  In addition, the new ordered mode reverses the lock
      ordering of the page lock and transaction lock, which provides easier
      support for delayed allocation.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cd1aac32
    • M
      ext4: Invert lock ordering of page_lock and transaction start in delalloc · 61628a3f
      Mingming Cao 提交于
      With the reverse locking, we need to start a transation before taking
      the page lock, so in ext4_da_writepages() we need to break the write-out
      into chunks, and restart the journal for each chunck to ensure the
      write-out fits in a single transaction.
      
      Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      which fixes delalloc sync hang with journal lock inversion, and address
      the performance regression issue.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      61628a3f
  5. 15 7月, 2008 1 次提交
    • M
      ext4: delayed allocation ENOSPC handling · d2a17637
      Mingming Cao 提交于
      This patch does block reservation for delayed
      allocation, to avoid ENOSPC later at page flush time.
      
      Blocks(data and metadata) are reserved at da_write_begin()
      time, the freeblocks counter is updated by then, and the number of
      reserved blocks is store in per inode counter.
              
      At the writepage time, the unused reserved meta blocks are returned
      back. At unlink/truncate time, reserved blocks are properly released.
      
      Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      to fix the oldallocator block reservation accounting with delalloc, added
      lock to guard the counters and also fix the reservation for meta blocks.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d2a17637
  6. 12 7月, 2008 8 次提交
  7. 30 4月, 2008 2 次提交
  8. 22 4月, 2008 1 次提交
  9. 17 4月, 2008 2 次提交
  10. 29 4月, 2008 1 次提交
    • A
      ext4: Fix race between migration and mmap write · 267e4db9
      Aneesh Kumar K.V 提交于
      Fail migrate if we allocated new blocks via mmap write.
      
      If we write to holes in the file via mmap, we end up allocating
      new blocks. This block allocation happens without taking inode->i_mutex.
      Since migrate is protected by i_mutex and migrate expects that no
      new blocks get allocated during migrate, fail migrate if new blocks
      get allocated.
      
      We can't take inode->i_mutex in the mmap write path because that
      would result in a locking order violation between i_mutex and mmap_sem.
      Also adding a separate rw_sempahore for protection is really high overhead
      for a rare operation such as migrate.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      267e4db9
  11. 26 2月, 2008 1 次提交
    • M
      ext4: Fix BUG when writing to an unitialized extent · f5ab0d1f
      Mingming Cao 提交于
      This patch fixes a bug when writing to preallocated but uninitialized
      blocks, which resulted in a BUG in fs/buffer.c saying that the buffer
      is not mapped.
      
      When writing to a file, ext4_get_block_wrap() is called with create=1 in
      order to request that blocks be allocated if necessary.  It currently
      calls ext4_get_blocks() with create=0 in order to do a lookup first.  If
      the inode contains an unitialized data block, the buffer head is left
      unampped, which ext4_get_blocks_wrap() returns, causing the BUG.
      
      We fix this by checking to see if the buffer head is unmapped, and if
      so, we make sure the the buffer head is mapped by calling
      ext4_ext_get_blocks with create=1.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f5ab0d1f
  12. 16 2月, 2008 1 次提交
    • V
      ext4: modify block allocation algorithm for the last group · 74d3487f
      Valerie Clement 提交于
      When a directory inode is allocated in the last group and the last group
      contains less than s_blocks_per_group blocks, the initial block allocated
      for the directory is not always allocated in the same group as the
      directory inode, but in one of the first groups of the filesystem (group 1
      for example).
      Depending on the current process's pid, ext4_find_near() and 
      ext4_ext_find_goal() can return a block number greater than the maximum
      blocks count in the filesystem and in that case the block will be not
      allocated in the same group as the inode.
      
      The following patch fixes the problem.
      
      Should the modification also be done in ext2/3 code?
      Signed-off-by: NValerie Clement <valerie.clement@bull.net>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      74d3487f
  13. 26 2月, 2008 1 次提交
  14. 10 2月, 2008 1 次提交
    • J
      ext4: Fix Direct I/O locking · 7fb5409d
      Jan Kara 提交于
      We cannot start transaction in ext4_direct_IO() and just let it last
      during the whole write because dio_get_page() acquires mmap_sem which
      ranks above transaction start (e.g. because we have dependency chain
      mmap_sem->PageLock->journal_start, or because we update atime while
      holding mmap_sem) and thus deadlocks could happen. We solve the problem
      by starting a transaction separately for each ext4_get_block() call.
      
      We *could* have a problem that we allocate a block and before its data
      are written out the machine crashes and thus we expose stale data. But
      that does not happen because for hole-filling generic code falls back to
      buffered writes and for file extension, we add inode to orphan list and
      thus in case of crash, journal replay will truncate inode back to the
      original size.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7fb5409d
  15. 08 2月, 2008 1 次提交
  16. 07 2月, 2008 1 次提交
  17. 06 2月, 2008 2 次提交
    • E
      allow in-inode EAs on ext4 root inode · 0040d987
      Eric Sandeen 提交于
      The ext3 root inode was treated specially with respect
      to in-inode extended attributes, for reasons detailed
      in the removed comment below.  The first mkfs-created
      inodes would not get extra_i_size or the EXT3_STATE_XATTR
      flag set in ext3_read_inode, which disallowed reading or
      setting in-inode EAs on the root.
      
      However, in ext4, ext4_mark_inode_dirty calls
      ext4_expand_extra_isize for all inodes; once this is done
      EAs may be placed in the root ext4 inode body.
      
      But for reasons above, it won't be found after a reboot.
      
      testcase:
      
      setfattr -n user.name -v value mntpt/
      setfattr -n user.name2 -v value2 mntpt/
      umount mntpt/; remount mntpt/
      getfattr -d mntpt/
      
      name2/value2 has gone missing; debugfs shows it in the
      inode body, but it is not found there by getattr.
      
      The following fixes it up; newer mkfs appears to properly
      zero the inodes, so this workaround isn't needed for ext4.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0040d987
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  18. 29 1月, 2008 9 次提交