1. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  2. 01 1月, 2009 1 次提交
  3. 07 11月, 2008 2 次提交
  4. 17 10月, 2008 1 次提交
  5. 16 10月, 2008 1 次提交
  6. 14 10月, 2008 1 次提交
  7. 11 10月, 2008 1 次提交
  8. 07 10月, 2008 1 次提交
  9. 10 10月, 2008 2 次提交
  10. 14 9月, 2008 2 次提交
  11. 09 9月, 2008 2 次提交
  12. 09 10月, 2008 1 次提交
  13. 10 10月, 2008 1 次提交
  14. 09 9月, 2008 1 次提交
  15. 09 10月, 2008 1 次提交
    • A
      ext4: Make sure all the block allocation paths reserve blocks · a30d542a
      Aneesh Kumar K.V 提交于
      With delayed allocation we need to make sure block are reserved before
      we attempt to allocate them. Otherwise we get block allocation failure
      (ENOSPC) during writepages which cannot be handled. This would mean
      silent data loss (We do a printk stating data will be lost). This patch
      updates the DIO and fallocate code path to do block reservation before
      block allocation. This is needed to make sure parallel DIO and fallocate
      request doesn't take block out of delayed reserve space.
      
      When free blocks count go below a threshold we switch to a slow patch
      which looks at other CPU's accumulated percpu counter values.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a30d542a
  16. 20 8月, 2008 1 次提交
  17. 09 9月, 2008 1 次提交
  18. 03 8月, 2008 1 次提交
  19. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  20. 19 8月, 2008 1 次提交
  21. 20 8月, 2008 5 次提交
    • M
      ext4: journal credit fix for the delayed allocation's writepages() function · 525f4ed8
      Mingming Cao 提交于
      Previous delalloc writepages implementation started a new transaction
      outside of a loop which called get_block() to do the block allocation.
      Since we didn't know exactly how many blocks would need to be allocated,
      the estimated journal credits required was very conservative and caused
      many issues.
      
      With the reworked delayed allocation, a new transaction is created for
      each get_block(), thus we don't need to guess how many credits for the
      multiple chunk of allocation.  We start every transaction with enough
      credits for inserting a single exent.  When estimate the credits for
      indirect blocks to allocate a chunk of blocks, we need to know the
      number of data blocks to allocate.  We use the total number of reserved
      delalloc datablocks; if that is too big, for non-extent files, we need
      to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS.
      
      Code cleanup from Aneesh.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      525f4ed8
    • A
      ext4: Rework the ext4_da_writepages() function · a1d6cc56
      Aneesh Kumar K.V 提交于
      With the below changes we reserve credit needed to insert only one
      extent resulting from a call to single get_block.  This makes sure we
      don't take too much journal credits during writeout.  We also don't
      limit the pages to write.  That means we loop through the dirty pages
      building largest possible contiguous block request.  Then we issue a
      single get_block request.  We may get less block that we requested.  If
      so we would end up not mapping some of the buffer_heads.  That means
      those buffer_heads are still marked delay.  Later in the writepage
      callback via __mpage_writepage we redirty those pages.
      
      We should also not limit/throttle wbc->nr_to_write in the filesystem
      writepages callback. That cause wrong behaviour in
      generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a1d6cc56
    • M
      ext4: journal credits reservation fixes for DIO, fallocate · f3bd1f3f
      Mingming Cao 提交于
      DIO and fallocate credit calculation is different than writepage, as
      they do start a new journal right for each call to ext4_get_blocks_wrap().
      This patch uses the helper function in DIO and fallocate case, passing
      a flag indicating that the modified data are contigous thus could account
      less indirect/index blocks.
      
      This patch also fixed the journal credit reservation for direct I/O
      (DIO).  Previously the estimated credits for DIO only was calculated for
      non-extent files, which was not enough if the file is extent-based.
      
      Also fixed was fallocate double-counting credits for modifying the the
      superblock.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f3bd1f3f
    • M
      ext4: journal credits calulation cleanup and fix for non-extent writepage · a02908f1
      Mingming Cao 提交于
      When considering how many journal credits are needed for modifying a
      chunk of data, we need to account for the super block, inode block,
      quota blocks and xattr block, indirect/index blocks, also, group bitmap
      and group descriptor blocks for new allocation (including data and
      indirect/index blocks). There are many places in ext4 do the calculation
      on their own and often missed one or two meta blocks, and often they
      assume single block allocation, and did not considering the multile
      chunk of allocation case.
      
      This patch is trying to cleanup current journal credit code, provides
      some common helper funtion to calculate the journal credits, to be used
      for writepage, writepages, DIO, fallocate, migration, defrag, and for
      both nonextent and extent files.
      
      This patch modified the writepage/write_begin credit caculation for
      nonextent files, to use the new helper function. It also fixed the
      problem that writepage on nonextent files did not consider the case
      blocksize <pagesize, thus could possibelly need multiple block
      allocation in a single transaction.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a02908f1
    • M
      ext4: Fix delalloc release block reservation for truncate · cd213226
      Mingming Cao 提交于
      Ext4 will release the reserved blocks for delayed allocations when
      inode is truncated/unlinked.  If there is no reserved block at all, we
      shouldn't need to do so.  But current code still tries to release the
      reserved blocks regardless whether the counters's value is 0.
      Continue to do that causes the later calculation to go wrong and a
      kernel BUG_ON() caught that. This doesn't happen for extent-based
      files, as the calculation for 0 reserved blocks was right for extent
      based file.
      
      This patch fixed the kernel BUG() due to above reason.  It adds checks
      for 0 to avoid unnecessary release and fix calculation for non-extent
      files.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cd213226
  22. 14 8月, 2008 1 次提交
  23. 20 8月, 2008 1 次提交
  24. 27 7月, 2008 1 次提交
  25. 18 7月, 2008 1 次提交
  26. 03 8月, 2008 2 次提交
  27. 27 7月, 2008 1 次提交
    • H
      ext4: don't read inode block if the buffer has a write error · 9c83a923
      Hidehiro Kawai 提交于
      A transient I/O error can corrupt inode data.  Here is the scenario:
      
      (1) update inode_A at the block_B
      (2) pdflush writes out new inode_A to the filesystem, but it results
          in write I/O error, at this point, BH_Uptodate flag of the buffer
          for block_B is cleared and BH_Write_EIO is set
      (3) create new inode_C which located at block_B, and
          __ext4_get_inode_loc() tries to read on-disk block_B because the
          buffer is not uptodate
      (4) if it can read on-disk block_B successfully, inode_A is
          overwritten by old data
      
      This patch makes __ext4_get_inode_loc() not read the inode block if the
      buffer has BH_Write_EIO flag.  In this case, the buffer should have the
      latest information, so setting the uptodate flag to the buffer (this
      avoids WARN_ON_ONCE() in mark_buffer_dirty().)
      
      According to this change, we would need to test BH_Write_EIO flag for the
      error checking.  Currently nobody checks write I/O errors on metadata
      buffers, but it will be done in other patches I'm working on.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9c83a923
  28. 12 7月, 2008 4 次提交
    • M
      ext4: delayed allocation i_blocks fix for stat · 3e3398a0
      Mingming Cao 提交于
      Right now i_blocks is not getting updated until the blocks are actually
      allocaed on disk.  This means with delayed allocation, right after files
      are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
      also shows the files taking zero space, which is highly confusing to the
      user.
      
      Since delayed allocation already keeps track of per-inode total
      number of blocks that are subject to delayed allocation, this patch fix
      this by using that to adjust the value returned by stat(2). When real
      block allocation is done, the i_blocks will get updated. Since the
      reserved blocks for delayed allocation will be decreased, this will be
      keep value returned by stat(2) consistent.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e3398a0
    • M
      ext4: fix delalloc i_disksize early update issue · 632eaeab
      Mingming Cao 提交于
      Ext4_da_write_end() used walk_page_buffers() with a callback function of
      ext4_bh_unmapped_or_delay() to check if it extended the file size
      without allocating any blocks (since in this case i_disksize needs to be
      updated).  However, this is didn't work proprely because the buffer head
      has not been marked dirty yet --- this is done later in
      block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
      always return false.
      
      In addition, walk_page_buffers() checks all of the buffer heads covering
      the page, and the only buffer_head that should be checked is the one
      covering the end of the write.  Otherwise, given a 1k blocksize
      filesystem and a 4k page size, the buffer head covering the first 1k
      stripe of the file could be unmapped (because it was a sparse file), and
      the second or third buffer_head covering that page could be mapped, and
      using walk_page_buffers() would fail in this case since it would stop at
      the first unmapped buffer_head and return true.
      
      The core problem is that walk_page_buffers() was intended to do work in
      a callback function, and a non-zero return value indicated a failure,
      which termined the walk of the buffer heads covering the page.  It was
      not intended to be used with a boolean function, such as
      ext4_bh_unmapped_or_delay().
      
      Add addtional fix from Aneesh to protect i_disksize update rave with truncate.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      632eaeab
    • A
      ext4: Handle page without buffers in ext4_*_writepage() · f0e6c985
      Aneesh Kumar K.V 提交于
      It can happen that buffers are removed from the page before it gets
      marked dirty and then is passed to writepage().  In writepage() we just
      initialize the buffers and check whether they are mapped and non
      delay. If they are mapped and non delay we write the page. Otherwise we
      mark them dirty.  With this change we don't do block allocation at all
      in ext4_*_write_page.
      
      writepage() can get called under many condition and with a locking order
      of journal_start -> lock_page, we should not try to allocate blocks in
      writepage() which get called after taking page lock.  writepage() can
      get called via shrink_page_list even with a journal handle which was
      created for doing inode update.  For example when doing
      ext4_da_write_begin we create a journal handle with credit 1 expecting a
      i_disksize update for the inode. But ext4_da_write_begin can cause
      shrink_page_list via _grab_page_cache. So having a valid handle via
      ext4_journal_current_handle is not a guarantee that we can use the
      handle for block allocation in writepage, since we shouldn't be using
      credits that had been reserved for other updates.  That it could result
      in we running out of credits when we update inodes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f0e6c985
    • A
      ext4: Add ordered mode support for delalloc · cd1aac32
      Aneesh Kumar K.V 提交于
      This provides a new ordered mode implementation which gets rid of using
      buffer heads to enforce the ordering between metadata change with the
      related data chage.  Instead, in the new ordering mode, it keeps track
      of all of the inodes touched by each transaction on a list, and when
      that transaction is committed, it flushes all of the dirty pages for
      those inodes.  In addition, the new ordered mode reverses the lock
      ordering of the page lock and transaction lock, which provides easier
      support for delayed allocation.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cd1aac32