1. 17 3月, 2009 1 次提交
    • E
      ext4: fix bb_prealloc_list corruption due to wrong group locking · d33a1976
      Eric Sandeen 提交于
      This is for Red Hat bug 490026: EXT4 panic, list corruption in
      ext4_mb_new_inode_pa
      
      ext4_lock_group(sb, group) is supposed to protect this list for
      each group, and a common code flow to remove an album is like
      this:
      
          ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
          ext4_lock_group(sb, grp);
          list_del(&pa->pa_group_list);
          ext4_unlock_group(sb, grp);
      
      so it's critical that we get the right group number back for
      this prealloc context, to lock the right group (the one 
      associated with this pa) and prevent concurrent list manipulation.
      
      however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
      comment, "-1 is to protect from crossing allocation group".
      
      This makes sense for the group_pa, where pa_pstart is advanced
      by the length which has been used (in ext4_mb_release_context()),
      and when the entire length has been used, pa_pstart has been
      advanced to the first block of the next group.
      
      However, for inode_pa, pa_pstart is never advanced; it's just
      set once to the first block in the group and not moved after
      that.  So in this case, if we subtract one in ext4_mb_put_pa(),
      we are actually locking the *previous* group, and opening the
      race with the other threads which do not subtract off the extra
      block.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d33a1976
  2. 14 3月, 2009 1 次提交
    • E
      ext4: fix bogus BUG_ONs in in mballoc code · 8d03c7a0
      Eric Sandeen 提交于
      Thiemo Nagel reported that:
      
      # dd if=/dev/zero of=image.ext4 bs=1M count=2
      # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
        -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
      # mount -o loop image.ext4 mnt/
      # dd if=/dev/zero of=mnt/file
      
      oopsed, with a BUG_ON in ext4_mb_normalize_request because
      size == EXT4_BLOCKS_PER_GROUP
      
      It appears to me (esp. after talking to Andreas) that the BUG_ON
      is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
      be allowed, though larger sizes do indicate a problem.
      
      Fix that an another (apparently rare) codepath with a similar check.
      Reported-by: NThiemo Nagel <thiemo.nagel@ph.tum.de>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8d03c7a0
  3. 13 3月, 2009 1 次提交
  4. 11 3月, 2009 1 次提交
    • E
      ext4: fix header check in ext4_ext_search_right() for deep extent trees. · 395a87bf
      Eric Sandeen 提交于
      The ext4_ext_search_right() function is confusing; it uses a
      "depth" variable which is 0 at the root and maximum at the leaves, 
      but the on-disk metadata uses a "depth" (actually eh_depth) which
      is opposite: maximum at the root, and 0 at the leaves.
      
      The ext4_ext_check_header() function is given a depth and checks
      the header agaisnt that depth; it expects the on-disk semantics,
      but we are giving it the opposite in the while loop in this 
      function.  We should be giving it the on-disk notion of "depth"
      which we can get from (p_depth - depth) - and if you look, the last
      (more commonly hit) call to ext4_ext_check_header() does just this.
      
      Sending in the wrong depth results in (incorrect) messages
      about corruption:
      
      EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
      in inode #2621457: unexpected eh_depth - magic f30a, entries 340,
      max 340(0), depth 1(2)
      
      http://bugzilla.kernel.org/show_bug.cgi?id=12821Reported-by: NDavid Dindorp <ddi@dubex.dk>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      395a87bf
  5. 05 3月, 2009 1 次提交
    • E
      ext4: fix ext4_free_inode() vs. ext4_claim_inode() race · 7ce9d5d1
      Eric Sandeen 提交于
      I was seeing fsck errors on inode bitmaps after a 4 thread
      dbench run on a 4 cpu machine:
      
      Inode bitmap differences: -50736 -(50752--50753) etc...
      
      I believe that this is because ext4_free_inode() uses atomic
      bitops, and although ext4_new_inode() *used* to also use atomic 
      bitops for synchronization, commit 
      39341867 changed this to use
      the sb_bgl_lock, so that we could also synchronize against
      read_inode_bitmap and initialization of uninit inode tables.
      
      However, that change left ext4_free_inode using atomic bitops,
      which I think leaves no synchronization between setting & 
      unsetting bits in the inode table.
      
      The below patch fixes it for me, although I wonder if we're 
      getting at all heavy-handed with this spinlock...
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7ce9d5d1
  6. 26 2月, 2009 1 次提交
  7. 28 2月, 2009 1 次提交
  8. 23 2月, 2009 1 次提交
  9. 22 2月, 2009 1 次提交
    • T
      ext4: Add fallback for find_group_flex · 05bf9e83
      Theodore Ts'o 提交于
      This is a workaround for find_group_flex() which badly needs to be
      replaced.  One of its problems (besides ignoring the Orlov algorithm)
      is that it is a bit hyperactive about returning failure under
      suspicious circumstances.  This can lead to spurious ENOSPC failures
      even when there are inodes still available.
      
      Work around this for now by retrying the search using
      find_group_other() if find_group_flex() returns -1.  If
      find_group_other() succeeds when find_group_flex() has failed, log a
      warning message.
      
      A better block/inode allocator that will fix this problem for real has
      been queued up for the next merge window.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      05bf9e83
  10. 16 2月, 2009 1 次提交
  11. 14 2月, 2009 2 次提交
  12. 11 2月, 2009 1 次提交
  13. 10 2月, 2009 1 次提交
    • W
      ext4: Fix to read empty directory blocks correctly in 64k · 7be2baaa
      Wei Yongjun 提交于
      The rec_len field in the directory entry is 16 bits, so there was a
      problem representing rec_len for filesystems with a 64k block size in
      the case where the directory entry takes the entire 64k block.
      Unfortunately, there were two schemes that were proposed; one where
      all zeros meant 65536 and one where all ones (65535) meant 65536.
      E2fsprogs used 0, whereas the kernel used 65535.  Oops.  Fortunately
      this case happens extremely rarely, with the most common case being
      the lost+found directory, created by mke2fs.
      
      So we will be liberal in what we accept, and accept both encodings,
      but we will continue to encode 65536 as 65535.  This will require a
      change in e2fsprogs, but with fortunately ext4 filesystems normally
      have the dir_index feature enabled, which precludes having a
      completely empty directory block.
      Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7be2baaa
  14. 11 2月, 2009 1 次提交
    • J
      jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() · 7f5aa215
      Jan Kara 提交于
      If we race with commit code setting i_transaction to NULL, we could
      possibly dereference it.  Proper locking requires the journal pointer
      (to access journal->j_list_lock), which we don't have.  So we have to
      change the prototype of the function so that filesystem passes us the
      journal pointer.  Also add a more detailed comment about why the
      function jbd2_journal_begin_ordered_truncate() does what it does and
      how it should be used.
      
      Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
      suspitious code.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      CC: linux-ext4@vger.kernel.org
      CC: ocfs2-devel@oss.oracle.com
      CC: mfasheh@suse.de
      CC: Dan Carpenter <error27@gmail.com>
      7f5aa215
  15. 10 2月, 2009 1 次提交
  16. 30 1月, 2009 1 次提交
    • T
      ext4: Remove bogus BUG() check in ext4_bmap() · b9ec63f7
      Theodore Ts'o 提交于
      The code to support journal-less ext4 operation added a BUG to
      ext4_bmap() which fired if there was no journal and the
      EXT4_STATE_JDATA bit was set in the i_state field.  This caused
      running the filefrag program (which uses the FIMBAP ioctl) to trigger
      a BUG().
      
      The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
      harmless for the bit to be set.  We could add a check in
      __ext4_journalled_writepage() and ext4_journalled_write_end() to only
      set the EXT4_STATE_JDATA bit if the journal is present, but that adds
      an extra test and jump instruction.  It's easier to simply remove the
      BUG check.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=12568Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      b9ec63f7
  17. 27 1月, 2009 2 次提交
  18. 20 1月, 2009 1 次提交
  19. 17 1月, 2009 1 次提交
  20. 18 1月, 2009 1 次提交
    • T
      ext4: only use i_size_high for regular files · 06a279d6
      Theodore Ts'o 提交于
      Directories are not allowed to be bigger than 2GB, so don't use
      i_size_high for anything other than regular files.  E2fsck should
      complain about these inodes, but the simplest thing to do for the
      kernel is to only use i_size_high for regular files.
      
      This prevents an intentially corrupted filesystem from causing the
      kernel to burn a huge amount of CPU and issuing error messages such
      as:
      
      EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max
      
      Thanks to David Maciejak from Fortinet's FortiGuard Global Security
      Research Team for reporting this issue.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=12375Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      06a279d6
  21. 10 1月, 2009 1 次提交
    • T
      filesystem freeze: add error handling of write_super_lockfs/unlockfs · c4be0c1d
      Takashi Sato 提交于
      Currently, ext3 in mainline Linux doesn't have the freeze feature which
      suspends write requests.  So, we cannot take a backup which keeps the
      filesystem's consistency with the storage device's features (snapshot and
      replication) while it is mounted.
      
      In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
      and it would be used to get the consistent backup.
      
      If Linux's standard filesystem ext3 has the freeze feature, we can do it
      without a commercial filesystem.
      
      So I have implemented the ioctls of the freeze feature.
      I think we can take the consistent backup with the following steps.
      1. Freeze the filesystem with the freeze ioctl.
      2. Separate the replication volume or create the snapshot
         with the storage device's feature.
      3. Unfreeze the filesystem with the unfreeze ioctl.
      4. Take the backup from the separated replication volume
         or the snapshot.
      
      This patch:
      
      VFS:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they can return an error.
      Rename write_super_lockfs and unlockfs of the super block operation
      freeze_fs and unfreeze_fs to avoid a confusion.
      
      ext3, ext4, xfs, gfs2, jfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that write_super_lockfs returns an error if needed,
      and unlockfs always returns 0.
      
      reiserfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they always return 0 (success) to keep a current behavior.
      Signed-off-by: NTakashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: NMasayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
      Cc: <xfs-masters@oss.sgi.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4be0c1d
  22. 09 1月, 2009 2 次提交
  23. 07 1月, 2009 2 次提交
  24. 06 1月, 2009 1 次提交
  25. 07 1月, 2009 1 次提交
  26. 06 1月, 2009 3 次提交
  27. 05 1月, 2009 2 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
    • P
      fs: introduce bgl_lock_ptr() · c644f0e4
      Pekka Enberg 提交于
      As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in
      <linux/blockgroup_lock.h> and add separate sb_bgl_lock() helpers to
      filesystem specific header files to break the hidden dependency to
      struct ext[234]_sb_info.
      
      Also, while at it, convert the macros to static inlines to try make up
      for all the times I broke Andrew Morton's tree.
      Acked-by: NAndreas Dilger <adilger@sun.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c644f0e4
  28. 04 1月, 2009 1 次提交
  29. 07 1月, 2009 1 次提交
  30. 06 1月, 2009 4 次提交