1. 23 12月, 2009 1 次提交
  2. 10 12月, 2009 1 次提交
    • J
      ext3: Fix data / filesystem corruption when write fails to copy data · 68eb3db0
      Jan Kara 提交于
      When ext3_write_begin fails after allocating some blocks or
      generic_perform_write fails to copy data to write, we truncate blocks already
      instantiated beyond i_size. Although these blocks were never inside i_size, we
      have to truncate pagecache of these blocks so that corresponding buffers get
      unmapped. Otherwise subsequent __block_prepare_write (called because we are
      retrying the write) will find the buffers mapped, not call ->get_block, and
      thus the page will be backed by already freed blocks leading to filesystem and
      data corruption.
      Reported-by: NJames Y Knight <foom@fuhm.net>
      Signed-off-by: NJan Kara <jack@suse.cz>
      68eb3db0
  3. 04 12月, 2009 1 次提交
  4. 11 11月, 2009 2 次提交
  5. 16 9月, 2009 3 次提交
    • C
      ext3: Add locking to ext3_do_update_inode · 4f003fd3
      Chris Mason 提交于
      I've been struggling with this off and on while I've been testing the
      data=guarded work.  The symptom is corrupted orphan lists and inodes
      with the wrong i_size stored on disk.  I was convinced the
      data=guarded code was just missing a call to ext3_mark_inode_dirty, but
      tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
      wasn't actually making it to the drive.
      
      ext3_mark_inode_dirty can be called without locks held (atime updates
      and a few others), so the data=guarded code uses locks while updating
      the in-memory inode, and then calls ext3_mark_inode_dirty
      without any locks held.
      
      But, ext3_mark_inode_dirty has no internal locking to make sure that
      only one CPU is updating the buffer head at a time.  Generally this
      works out ok because everyone that changes the inode then calls
      ext3_mark_inode_dirty themselves.  Even though it races, eventually
      someone updates the buffer heads and things move on.
      
      But there is still a risk of the wrong values getting in, and the
      data=guarded code seems to hit the race very often.
      
      Since everyone that changes the inode also logs it, it should be
      possible to fix this with some memory barriers.  I'll leave that as an
      exercise to the reader and lock the buffer head instead.
      
      It it probably a good idea to have a different patch series for lockless
      bit flipping on the ext3 i_state field.  ext3_do_update_inode &= clears
      EXT3_STATE_NEW without any locks held.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      4f003fd3
    • J
      ext3: Fix possible deadlock between ext3_truncate() and ext3_get_blocks() · 00171d3c
      Jan Kara 提交于
      During truncate we are sometimes forced to start a new transaction as the
      amount of blocks to be journaled is both quite large and hard to predict. So
      far we restarted a transaction while holding truncate_mutex and that violates
      lock ordering because truncate_mutex ranks below transaction start (and it
      can lead to a real deadlock with ext3_get_blocks() allocating new blocks
      from ext3_writepage()).
      
      Luckily, the problem is easy to fix: We just drop the truncate_mutex before
      restarting the transaction and acquire it afterwards. We are safe to do this as
      by the time ext3_truncate() is called, all the page cache for the truncated
      part of the file is dropped and so writepage() cannot come and allocate new
      blocks in the part of the file we are truncating. The rest of writers is
      stopped by us holding i_mutex.
      Signed-off-by: NJan Kara <jack@suse.cz>
      00171d3c
    • A
      HWPOISON: Enable .remove_error_page for migration aware file systems · aa261f54
      Andi Kleen 提交于
      Enable removing of corrupted pages through truncation
      for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
      These should cover most server needs.
      
      I chose the set of migration aware file systems for this
      for now, assuming they have been especially audited.
      But in general it should be safe for all file systems
      on the data area that support read/write and truncate.
      
      Caveat: the hardware error handler does not take i_mutex
      for now before calling the truncate function. Is that ok?
      
      Cc: tytso@mit.edu
      Cc: hch@infradead.org
      Cc: mfasheh@suse.com
      Cc: aia21@cantab.net
      Cc: hugh.dickins@tiscali.co.uk
      Cc: swhiteho@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      aa261f54
  6. 16 7月, 2009 2 次提交
    • J
      ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() · 43237b54
      Jan Kara 提交于
      Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
      be a relict from some old days and setting disksize in this function does not
      make much sence. Currently it was set only by ext3_getblk().  Since the
      parameter has some effect only if create == 1, it is easy to check that the
      three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
      ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.
      Signed-off-by: NJan Kara <jack@suse.cz>
      43237b54
    • J
      ext3: Fix truncation of symlinks after failed write · 9eaaa2d5
      Jan Kara 提交于
      Contents of long symlinks is written via standard write methods. So when the
      write fails, we add inode to orphan list. But symlinks don't have .truncate
      method defined so nobody properly removes them from the orphan list (both on
      disk and in memory).
      
      Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
      (which is saner anyway since we don't need anything vmtruncate() does except
      from calling .truncate in these paths).  We also add inode to orphan list only
      if ext3_can_truncate() is true (currently, it can be false for symlinks when
      there are no blocks allocated) - otherwise orphan list processing will complain
      and ext3_truncate() will not remove inode from on-disk orphan list.
      Signed-off-by: NJan Kara <jack@suse.cz>
      9eaaa2d5
  7. 24 6月, 2009 1 次提交
  8. 19 6月, 2009 2 次提交
  9. 12 6月, 2009 1 次提交
  10. 09 4月, 2009 1 次提交
  11. 03 4月, 2009 2 次提交
    • T
      ext3: Add replace-on-truncate hueristics for data=writeback mode · f7ab34ea
      Theodore Ts'o 提交于
      In data=writeback mode, start an asynchronous flush when closing a
      file which had been previously truncated down to zero.  This lowers
      the probability of data loss in the case of applications that attempt
      to replace a file using truncate.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f7ab34ea
    • J
      ext3: avoid false EIO errors · 695f6ae0
      Jan Kara 提交于
      Sometimes block_write_begin() can map buffers in a page but later we
      fail to copy data into those buffers (because the source page has been
      paged out in the mean time).  We then end up with !uptodate mapped
      buffers.  To add a bit more to the confusion, block_write_end() does
      not commit any data (and thus does not any mark buffers as uptodate) if
      we didn't succeed with copying all the data.
      
      Commit f4fc66a8 (ext3: convert to new
      aops) missed these cases and thus we were inserting non-uptodate
      buffers to transaction's list which confuses JBD code and it reports IO
      errors, aborts a transaction and generally makes users afraid about
      their data ;-P.
      
      This patch fixes the problem by reorganizing ext3_..._write_end() code
      to first call block_write_end() to mark buffers with valid data
      uptodate and after that we file only uptodate buffers to transaction's
      lists.
      
      We also fix a problem where we could leave blocks allocated beyond i_size
      (i_disksize in fact) because of failed write. We now add inode to orphan
      list when write fails (to be safe in case we crash) and then truncate blocks
      beyond i_size in a separate transaction.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      695f6ae0
  12. 27 3月, 2009 1 次提交
  13. 26 3月, 2009 1 次提交
  14. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  15. 01 1月, 2009 1 次提交
  16. 20 10月, 2008 1 次提交
  17. 04 10月, 2008 1 次提交
    • J
      generic block based fiemap implementation · 68c9d702
      Josef Bacik 提交于
      Any block based fs (this patch includes ext3) just has to declare its own
      fiemap() function and then call this generic function with its own
      get_block_t. This works well for block based filesystems that will map
      multiple contiguous blocks at one time, but will work for filesystems that
      only map one block at a time, you will just end up with an "extent" for each
      block. One gotcha is this will not play nicely where there is hole+data
      after the EOF. This function will assume its hit the end of the data as soon
      as it hits a hole after the EOF, so if there is any data past that it will
      not pick that up. AFAIK no block based fs does this anyway, but its in the
      comments of the function anyway just in case.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      68c9d702
  18. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  19. 26 7月, 2008 3 次提交
    • D
      ext3: handle deleting corrupted indirect blocks · 3ccc3167
      Duane Griffin 提交于
      While freeing indirect blocks we attach a journal head to the parent
      buffer head, free the blocks, then journal the parent.  If the indirect
      block list is corrupted and points to the parent the journal head will be
      detached when the block is cleared, causing an OOPS.
      
      Check for that explicitly and handle it gracefully.
      
      This patch fixes the third case (image hdb.20000057.nullderef.gz)
      reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
      
      Immediately above the change, in the ext3_free_data function, we call
      ext3_clear_blocks to clear the indirect blocks in this parent block.  If
      one of those blocks happens to actually be the parent block it will clear
      b_private / BH_JBD.
      
      I did the check at the end rather than earlier as it seemed more elegant.
      I don't think there should be much practical difference, although it is
      possible the FS may not be quite so badly corrupted if we did it the other
      way (and didn't clear the block at all).  To be honest, I'm not convinced
      there aren't other similar failure modes lurking in this code, although I
      couldn't find any with a quick review.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ccc3167
    • H
      ext3: don't read inode block if the buffer has a write error · 95450f5a
      Hidehiro Kawai 提交于
      A transient I/O error can corrupt inode data.  Here is the scenario:
      
      (1) update inode_A at the block_B
      (2) pdflush writes out new inode_A to the filesystem, but it results
          in write I/O error, at this point, BH_Uptodate flag of the buffer
          for block_B is cleared and BH_Write_EIO is set
      (3) create new inode_C which located at block_B, and
          __ext3_get_inode_loc() tries to read on-disk block_B because the
          buffer is not uptodate
      (4) if it can read on-disk block_B successfully, inode_A is
          overwritten by old data
      
      This patch makes __ext3_get_inode_loc() not read the inode block if the
      buffer has BH_Write_EIO flag.  In this case, the buffer should have the
      latest information, so setting the uptodate flag to the buffer (this
      avoids WARN_ON_ONCE() in mark_buffer_dirty().)
      
      According to this change, we would need to test BH_Write_EIO flag for the
      error checking.  Currently nobody checks write I/O errors on metadata
      buffers, but it will be done in other patches I'm working on.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: sugita <yumiko.sugita.yf@hitachi.com>
      Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95450f5a
    • D
      ext3: handle corrupted orphan list at mount · ae76dd9a
      Duane Griffin 提交于
      If the orphan node list includes valid, untruncatable nodes with nlink > 0
      the ext3_orphan_cleanup loop which attempts to delete them will not do so,
      causing it to loop forever. Fix by checking for such nodes in the
      ext3_orphan_get function.
      
      This patch fixes the second case (image hdb.20000009.softlockup.gz)
      reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: printk warning fix]
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae76dd9a
  20. 30 4月, 2008 1 次提交
  21. 28 4月, 2008 2 次提交
  22. 22 4月, 2008 1 次提交
  23. 08 2月, 2008 1 次提交
  24. 07 2月, 2008 2 次提交
  25. 06 2月, 2008 1 次提交
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  26. 20 10月, 2007 1 次提交
  27. 19 10月, 2007 1 次提交
  28. 17 10月, 2007 1 次提交
  29. 17 7月, 2007 1 次提交
  30. 24 6月, 2007 1 次提交