1. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  2. 26 7月, 2008 3 次提交
    • D
      ext3: handle deleting corrupted indirect blocks · 3ccc3167
      Duane Griffin 提交于
      While freeing indirect blocks we attach a journal head to the parent
      buffer head, free the blocks, then journal the parent.  If the indirect
      block list is corrupted and points to the parent the journal head will be
      detached when the block is cleared, causing an OOPS.
      
      Check for that explicitly and handle it gracefully.
      
      This patch fixes the third case (image hdb.20000057.nullderef.gz)
      reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
      
      Immediately above the change, in the ext3_free_data function, we call
      ext3_clear_blocks to clear the indirect blocks in this parent block.  If
      one of those blocks happens to actually be the parent block it will clear
      b_private / BH_JBD.
      
      I did the check at the end rather than earlier as it seemed more elegant.
      I don't think there should be much practical difference, although it is
      possible the FS may not be quite so badly corrupted if we did it the other
      way (and didn't clear the block at all).  To be honest, I'm not convinced
      there aren't other similar failure modes lurking in this code, although I
      couldn't find any with a quick review.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ccc3167
    • H
      ext3: don't read inode block if the buffer has a write error · 95450f5a
      Hidehiro Kawai 提交于
      A transient I/O error can corrupt inode data.  Here is the scenario:
      
      (1) update inode_A at the block_B
      (2) pdflush writes out new inode_A to the filesystem, but it results
          in write I/O error, at this point, BH_Uptodate flag of the buffer
          for block_B is cleared and BH_Write_EIO is set
      (3) create new inode_C which located at block_B, and
          __ext3_get_inode_loc() tries to read on-disk block_B because the
          buffer is not uptodate
      (4) if it can read on-disk block_B successfully, inode_A is
          overwritten by old data
      
      This patch makes __ext3_get_inode_loc() not read the inode block if the
      buffer has BH_Write_EIO flag.  In this case, the buffer should have the
      latest information, so setting the uptodate flag to the buffer (this
      avoids WARN_ON_ONCE() in mark_buffer_dirty().)
      
      According to this change, we would need to test BH_Write_EIO flag for the
      error checking.  Currently nobody checks write I/O errors on metadata
      buffers, but it will be done in other patches I'm working on.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95450f5a
    • D
      ext3: handle corrupted orphan list at mount · ae76dd9a
      Duane Griffin 提交于
      If the orphan node list includes valid, untruncatable nodes with nlink > 0
      the ext3_orphan_cleanup loop which attempts to delete them will not do so,
      causing it to loop forever. Fix by checking for such nodes in the
      ext3_orphan_get function.
      
      This patch fixes the second case (image hdb.20000009.softlockup.gz)
      reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: printk warning fix]
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae76dd9a
  3. 30 4月, 2008 1 次提交
  4. 28 4月, 2008 2 次提交
  5. 22 4月, 2008 1 次提交
  6. 08 2月, 2008 1 次提交
  7. 07 2月, 2008 2 次提交
  8. 06 2月, 2008 1 次提交
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  9. 20 10月, 2007 1 次提交
  10. 19 10月, 2007 1 次提交
  11. 17 10月, 2007 1 次提交
  12. 17 7月, 2007 1 次提交
  13. 24 6月, 2007 1 次提交
  14. 10 5月, 2007 1 次提交
  15. 09 5月, 2007 3 次提交
    • J
      ext3: copy i_flags to inode flags on write · 28be5abb
      Jan Kara 提交于
      A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc.  from
      i_flags to EXT3_I(inode)->i_flags when inode is written to disk.  The same
      thing is done on GETFLAGS ioctl.
      
      Quota code changes these flags on quota files (to make it harder for
      sysadmin to screw himself) and these changes were not correctly propagated
      into the filesystem (especially, lsattr did not show them and users were
      wondering...).
      
      Propagate flags such as S_APPEND, S_IMMUTABLE, etc.  from i_flags into
      ext3-specific i_flags.  Hence, when someone sets these flags via a
      different interface than ioctl, they are stored correctly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28be5abb
    • R
      header cleaning: don't include smp_lock.h when not used · e63340ae
      Randy Dunlap 提交于
      Remove includes of <linux/smp_lock.h> where it is not used/needed.
      Suggested by Al Viro.
      
      Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
      sparc64, and arm (all 59 defconfigs).
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e63340ae
    • M
      ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systems · 4d7bf11d
      Markus Rechberger 提交于
      Taken from http://bugzilla.kernel.org/show_bug.cgi?id=5079
      
      signed long ranges from -2.147.483.648 to 2.147.483.647 on x86 32bit
      
      10000011110110100100111110111101 .. -2,082,844,739
      10000011110110100100111110111101 ..  2,212,122,557 <- this currently gets
      stored on the disk but when converting it to a 64bit signed long value it loses
      its sign and becomes positive.
      
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: <linux-ext4@vger.kernel.org>
      
      Andreas says:
      
      This patch is now treating timestamps with the high bit set as negative
      times (before Jan 1, 1970).  This means we lose 1/2 of the possible range
      of timestamps (lopping off 68 years before unix timestamp overflow -
      now only 30 years away :-) to handle the extremely rare case of setting
      timestamps into the distant past.
      
      If we are only interested in fixing the underflow case, we could just
      limit the values to 0 instead of storing negative values.  At worst this
      will skew the timestamp by a few hours for timezones in the far east
      (files would still show Jan 1, 1970 in "ls -l" output).
      
      That said, it seems 32-bit systems (mine at least) allow files to be set
      into the past (01/01/1907 works fine) so it seems this patch is bringing
      the x86_64 behaviour into sync with other kernels.
      
      On the plus side, we have a patch that is ready to add nanosecond timestamps
      to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so
      this extends the maximum date to 2242.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d7bf11d
  16. 03 4月, 2007 1 次提交
    • A
      [PATCH] revert "retries in ext3_prepare_write() violate ordering requirements" · 1aa9b4b9
      Andrew Morton 提交于
      Revert e92a4d59.
      
      Dmitry points out
      
      "When we block_prepare_write() failed while ext3_prepare_write() we jump to
       "failure" label and call ext3_prepare_failure() witch search last mapped bh
       and invoke commit_write untill it.  This is wrong!!  because some bh from
       begining to the last mapped bh may be not uptodate.  As a result we commit to
       disk not uptodate page content witch contains garbage from previous usage."
      
      and
      
      "Unexpected file size increasing."
      
         Call trace the same as it was in first issue but result is different.
         For example we have file with i_size is zero.  we want write two blocks ,
         but fs has only one free block.
      
         ->ext3_prepare_write(...from == 0, to == 2048)
           retry:
           ->block_prepare_write() == -ENOSPC# we failed but allocated one block here.
           ->ext3_prepare_failure()
             ->commit_write( from == 0, to == 1024) # after this i_size becomes 1024 :)
           if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
              goto retry;
      
         Finally when all retries will be spended ext3_prepare_failure return
         -ENOSPC, but i_size was increased and later block trimm procedures can't
         help here.
      
      We don't appear to have the horsepower to fix these issues, so let's put
      things back the way they were for now.
      
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ken Chen <kenneth.w.chen@intel.com>
      Cc: Andrey Savochkin <saw@sw.ru>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Dmitriy Monakhov <dmonakhov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa9b4b9
  17. 12 2月, 2007 1 次提交
  18. 08 12月, 2006 1 次提交
  19. 01 10月, 2006 1 次提交
  20. 27 9月, 2006 4 次提交
  21. 17 9月, 2006 1 次提交
    • S
      [PATCH] ext3 sequential read regression fix · 20acaa18
      Suparna Bhattacharya 提交于
      ext3-get-blocks support caused ~20% degrade in Sequential read
      performance (tiobench). Problem is with marking the buffer boundary
      so IO can be submitted right away. Here is the patch to fix it.
      
        2.6.18-rc6:
        -----------
        # ./iotest
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 75.2726 seconds, 57.1 MB/s
      
        real    1m15.285s
        user    0m0.276s
        sys     0m3.884s
      
        2.6.18-rc6 + fix:
        -----------------
        [root@elm3a241 ~]# ./iotest
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 62.9356 seconds, 68.2 MB/s
      
      The boundary block check in ext3_get_blocks_handle needs to be adjusted
      against the count of blocks mapped in this call, now that it can map
      more than one block.
      Signed-off-by: NSuparna Bhattacharya <suparna@in.ibm.com>
      Tested-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      20acaa18
  22. 09 9月, 2006 1 次提交
    • B
      [PATCH] ext3_getblk() should handle HOLE correctly · 3665d0e5
      Badari Pulavarty 提交于
      It has been reported that ext3_getblk() is not doing the right thing and
      triggering following WARN():
      
      BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
       <c01c5140> ext3_getblk+0x98/0x2a6  <c03b2806> md_wakeup_thread+0x26/0x2a
       <c01c536d> ext3_bread+0x1f/0x88  <c01cedf9> ext3_quota_read+0x136/0x1ae
       <c018b683> v1_read_dqblk+0x61/0xac  <c0188f32> dquot_acquire+0xf6/0x107
       <c01ceaba> ext3_acquire_dquot+0x46/0x68  <c01897d4> dqget+0x155/0x1e7
       <c018a97b> dquot_transfer+0x3e0/0x3e9  <c016fe52> dput+0x23/0x13e
       <c01c7986> ext3_setattr+0xc3/0x240  <c0120f66> current_fs_time+0x52/0x6a
       <c017320e> notify_change+0x2bd/0x30d  <c0159246> chown_common+0x9c/0xc5
       <c02a222c> strncpy_from_user+0x3b/0x68  <c0167fe6> do_path_lookup+0xdf/0x266
       <c016841b> __user_walk_fd+0x44/0x5a  <c01592b9> sys_chown+0x4a/0x55
       <c015a43c> vfs_write+0xe7/0x13c  <c01695d4> sys_mkdir+0x1f/0x23
       <c0102a97> syscall_call+0x7/0xb
      
      Looking at the code, it looks like it's not handle HOLE correctly.  It ends
      up returning -EIO.  Here is the patch to fix it.
      
      If we really want to be paranoid, we can allow return values 0 (HOLE), 1
      (we asked for one block) and return -EIO for more than 1 block.  But I
      really don't see a reason for doing it - all we need is the block# here.
      (doesn't matter how many blocks are mapped).
      
      ext3_get_blocks_handle() returns number of blocks it mapped.  It returns 0
      in case of HOLE.  ext3_getblk() should handle HOLE properly (currently its
      dumping warning stack and returning -EIO).
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3665d0e5
  23. 01 8月, 2006 2 次提交
  24. 29 6月, 2006 1 次提交
  25. 26 6月, 2006 2 次提交
    • M
      [PATCH] ext3_fsblk_t: the rest of in-kernel filesystem blocks conversion · 43d23f90
      Mingming Cao 提交于
      Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t.  Convert the
      rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t,
      and replace the printk format string respondingly.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      43d23f90
    • M
      [PATCH] ext3_fsblk_t: filesystem, group blocks and bug fixes · 1c2bf374
      Mingming Cao 提交于
      Some of the in-kernel ext3 block variable type are treated as signed 4 bytes
      int type, thus limited ext3 filesystem to 8TB (4kblock size based).  While
      trying to fix them, it seems quite confusing in the ext3 code where some
      blocks are filesystem-wide blocks, some are group relative offsets that need
      to be signed value (as -1 has special meaning).  So it seem saner to define
      two types of physical blocks: one is filesystem wide blocks, another is
      group-relative blocks.  The following patches clarify these two types of
      blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3
      filesystem limit to 8TB.
      
      With this series of patches and the percpu counter data type changes in the mm
      tree, we are able to extend exts filesystem limit to 16TB.
      
      This work is also a pre-request for the recent >32 bit ext3 work, and makes
      the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine
      ext3_fsblk_t from unsigned long to sector_t and redefine the format string for
      ext3 filesystem block corresponding.
      
      Two RFC with a series patches have been posted to ext2-devel list and have
      been reviewed and discussed:
      http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2
      
      http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2
      
      Patches are tested on both 32 bit machine and 64 bit machine, <8TB ext3 and
      >8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39).  Tests
      includes overnight fsx, tiobench, dbench and fsstress.
      
      This patch:
      
      Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for
      filesystem wide blocks.
      
      This patch classifies all block group relative blocks, and ext3_fsblk_t blocks
      occurs in the same function where used to be confusing before.  Also include
      kernel bug fixes for filesystem wide in-kernel block variables.  There are
      some fileystem wide blocks are treated as int/unsigned int type in the kernel
      currently, especially in ext3 block allocation and reservation code.  This
      patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned
      long) type.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1c2bf374
  26. 04 5月, 2006 1 次提交
  27. 27 3月, 2006 3 次提交