1. 01 5月, 2009 1 次提交
    • T
      ext4: Avoid races caused by on-line resizing and SMP memory reordering · 8df9675f
      Theodore Ts'o 提交于
      Ext4's on-line resizing adds a new block group and then, only at the
      last step adjusts s_groups_count.  However, it's possible on SMP
      systems that another CPU could see the updated the s_group_count and
      not see the newly initialized data structures for the just-added block
      group.  For this reason, it's important to insert a SMP read barrier
      after reading s_groups_count and before reading any (for example) the
      new block group descriptors allowed by the increased value of
      s_groups_count.
      
      Unfortunately, we rather blatently violate this locking protocol
      documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
      happen relatively rarely, and (2) it seems rare that the filesystem
      code will immediately try to use just-added block group before any
      memory ordering issues resolve themselves.  So apparently problems
      here are relatively hard to hit, since ext3 has been vulnerable to the
      same issue for years with no one apparently complaining.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8df9675f
  2. 02 5月, 2009 1 次提交
  3. 01 5月, 2009 2 次提交
    • T
      ext4: Fix and simplify s_dirt handling · 7234ab2a
      Theodore Ts'o 提交于
      The s_dirt flag wasn't completely handled correctly, but it didn't
      really matter when journalling was enabled.  It turns out that when
      ext4 runs without a journal, we don't clear s_dirt in places where we
      should have, with the result that the high-level write_super()
      function was writing the superblock when it wasn't necessary.
      
      So we fix this by making ext4_commit_super() clear the s_dirt flag,
      and removing many of the other places where s_dirt is manipulated.
      When journalling is enabled, the s_dirt flag might be left set more
      often, but s_dirt really doesn't matter when journalling is enabled.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7234ab2a
    • T
      ext4: Simplify ext4_commit_super()'s function signature · e2d67052
      Theodore Ts'o 提交于
      The ext4_commit_super() function took both a struct super_block * and
      a struct ext4_super_block *, but the struct ext4_super_block can be
      derived from the struct super_block.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e2d67052
  4. 25 4月, 2009 1 次提交
  5. 28 4月, 2009 1 次提交
    • T
      ext4: Fallback to vmalloc if kmalloc can't allocate s_flex_groups array · c5ca7c76
      Theodore Ts'o 提交于
      For very large filesystems, the s_flex_groups array can get quite big.
      For example, a filesystem that can be resized up to 16TB will have
      8192 flex groups (assuming the default flex_bg size of 16), so the
      array is 96k, which is *very* marginal for kmalloc().  On the other
      hand, a 160GB filesystem without the resize_inode feature will only
      require 960 bytes.  So we try to allocate the array first using
      kmalloc(), and if that fails, we'll try to use vmalloc() instead.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c5ca7c76
  6. 13 5月, 2009 1 次提交
    • A
      ext4: Mark the unwritten buffer_head as mapped during write_begin · 29fa89d0
      Aneesh Kumar K.V 提交于
      Setting BH_Unwritten buffer_heads as BH_Mapped avoids multiple
      (unnecessary) calls to get_block() during the call to the write(2)
      system call.  Setting BH_Unwritten buffer heads as BH_Mapped requires
      that the writepages() functions can handle BH_Unwritten buffer_heads.
      
      After this commit, things work as follows:
      
      ext4_ext_get_block() returns unmapped, unwritten, buffer head when
      called with create = 0 for prealloc space. This makes sure we handle
      the read path and non-delayed allocation case correctly.  Even though
      the buffer head is marked unmapped we have valid b_blocknr and b_bdev
      values in the buffer_head.
      
      ext4_da_get_block_prep() called for block resrevation will now return
      mapped, unwritten, new buffer_head for prealloc space. This avoids
      multiple calls to get_block() for write to same offset. By making such
      buffers as BH_New, we also assure that sub-block zeroing of buffered
      writes happens correctly.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      29fa89d0
  7. 14 5月, 2009 1 次提交
  8. 15 5月, 2009 2 次提交
    • T
      ext4: Fix race in ext4_inode_info.i_cached_extent · 2ec0ae3a
      Theodore Ts'o 提交于
      If two CPU's simultaneously call ext4_ext_get_blocks() at the same
      time, there is nothing protecting the i_cached_extent structure from
      being used and updated at the same time.  This could potentially cause
      the wrong location on disk to be read or written to, including
      potentially causing the corruption of the block group descriptors
      and/or inode table.
      
      This bug has been in the ext4 code since almost the very beginning of
      ext4's development.  Fortunately once the data is stored in the page
      cache cache, ext4_get_blocks() doesn't need to be called, so trying to
      replicate this problem to the point where we could identify its root
      cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
      working over several months to be able to reproduce this easily so we
      could finally nail down the cause of the corruption.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: N"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      2ec0ae3a
    • A
      ext4: Clear the unwritten buffer_head flag after the extent is initialized · 2a8964d6
      Aneesh Kumar K.V 提交于
      The BH_Unwritten flag indicates that the buffer is allocated on disk
      but has not been written; that is, the disk was part of a persistent
      preallocation area.  That flag should only be set when a get_blocks()
      function is looking up a inode's logical to physical block mapping.
      
      When ext4_get_blocks_wrap() is called with create=1, the uninitialized
      extent is converted into an initialized one, so the BH_Unwritten flag
      is no longer appropriate.  Hence, we need to make sure the
      BH_Unwritten is not left set, since the combination of BH_Mapped and
      BH_Unwritten is not allowed; among other things, it will result ext4's
      get_block() to be called over and over again during the write_begin
      phase of write(2).
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2a8964d6
  9. 13 5月, 2009 1 次提交
  10. 14 5月, 2009 1 次提交
  11. 25 4月, 2009 3 次提交
    • T
      ext4: Do not try to validate extents on special files · c4b5a614
      Theodore Ts'o 提交于
      The EXTENTS_FL flag should never be set on special files, but if it
      is, don't bother trying to validate that the extents tree is valid,
      since only files, directories, and non-fast symlinks will ever have an
      extent data structure.  We perhaps should flag the filesystem as being
      corrupted if we see a special file (named pipes, device nodes, Unix
      domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't
      currently check this case, so we'll just ignore this for now, since
      it's harmless.
      
      Without this fix, a special device with the extents flag is flagged as
      an error by the kernel, so it is impossible to access or delete the
      inode, but e2fsck doesn't see it as a problem, leading to
      confused/frustrated users.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c4b5a614
    • T
      ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present · a9e81742
      Theodore Ts'o 提交于
      Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
      bit is set.  The field is normally zero, but older versions of e2fsck
      didn't automatically check to make sure of this, so in the spirit of
      "be liberal in what you accept", don't look at i_file_acl_high unless
      we are using a 64-bit filesystem.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a9e81742
    • T
      ext4: Fix softlockup caused by illegal i_file_acl value in on-disk inode · 485c26ec
      Theodore Ts'o 提交于
      If the block containing external extended attributes (which is stored
      in i_file_acl and i_file_acl_high) is larger than the on-disk
      filesystem, the process which tried to access the extended attributes
      will endlessly issue kernel printks complaining that
      "__find_get_block_slow() failed", locking up that CPU until the system
      is forcibly rebooted.
      
      So when we read in the inode, make sure the i_file_acl value is legal,
      and if not, flag the filesystem as being corrupted.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      485c26ec
  12. 23 4月, 2009 2 次提交
    • T
      ext4: Fix potential inode allocation soft lockup in Orlov allocator · b5451f7b
      Theodore Ts'o 提交于
      If the Orlov allocator is having trouble finding an appropriate block
      group, the fallback code could loop forever, causing a soft lockup
      warning in find_group_orlov():
      
      BUG: soft lockup - CPU#0 stuck for 61s! [cp:11728]
           ...
      Pid: 11728, comm: cp Not tainted (2.6.30-rc1-dirty #77) Lenovo          
      EIP: 0060:[<c021650e>] EFLAGS: 00000246 CPU: 0
      EIP is at ext4_get_group_desc+0x54/0x9d
          ...
      Call Trace:
       [<c0218021>] find_group_orlov+0x2ee/0x334
       [<c0120a5f>] ? sched_clock+0x8/0xb
       [<c02188e3>] ext4_new_inode+0x2cf/0xb1a
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b5451f7b
    • T
      ext4: Make the extent validity check more paranoid · e84a26ce
      Theodore Ts'o 提交于
      Instead of just checking that the extent block number is greater or
      equal than s_first_data_block, make sure it it is not pointing into
      the block group descriptors, since that is clearly wrong.  This helps
      prevent filesystem from getting very badly corrupted in case an extent
      block is corrupted.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e84a26ce
  13. 15 4月, 2009 1 次提交
  14. 14 4月, 2009 1 次提交
  15. 08 4月, 2009 1 次提交
  16. 05 4月, 2009 1 次提交
  17. 08 4月, 2009 1 次提交
  18. 01 4月, 2009 2 次提交
    • N
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin 提交于
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c
    • A
      New helper - current_umask() · ce3b0f8d
      Al Viro 提交于
      current->fs->umask is what most of fs_struct users are doing.
      Put that into a helper function.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ce3b0f8d
  19. 30 3月, 2009 1 次提交
  20. 28 3月, 2009 4 次提交
  21. 31 3月, 2009 1 次提交
  22. 26 3月, 2009 4 次提交
  23. 17 3月, 2009 2 次提交
    • E
      ext4: fix bb_prealloc_list corruption due to wrong group locking · d33a1976
      Eric Sandeen 提交于
      This is for Red Hat bug 490026: EXT4 panic, list corruption in
      ext4_mb_new_inode_pa
      
      ext4_lock_group(sb, group) is supposed to protect this list for
      each group, and a common code flow to remove an album is like
      this:
      
          ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
          ext4_lock_group(sb, grp);
          list_del(&pa->pa_group_list);
          ext4_unlock_group(sb, grp);
      
      so it's critical that we get the right group number back for
      this prealloc context, to lock the right group (the one 
      associated with this pa) and prevent concurrent list manipulation.
      
      however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
      comment, "-1 is to protect from crossing allocation group".
      
      This makes sense for the group_pa, where pa_pstart is advanced
      by the length which has been used (in ext4_mb_release_context()),
      and when the entire length has been used, pa_pstart has been
      advanced to the first block of the next group.
      
      However, for inode_pa, pa_pstart is never advanced; it's just
      set once to the first block in the group and not moved after
      that.  So in this case, if we subtract one in ext4_mb_put_pa(),
      we are actually locking the *previous* group, and opening the
      race with the other threads which do not subtract off the extra
      block.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d33a1976
    • T
      ext4: Add auto_da_alloc mount option · afd4672d
      Theodore Ts'o 提交于
      Add a mount option which allows the user to disable automatic
      allocation of blocks whose allocation by delayed allocation when the
      file was originally truncated or when the file is renamed over an
      existing file.  This feature is intended to save users from the
      effects of naive application writers, but it reduces the effectiveness
      of the delayed allocation code.  This mount option disables this
      safety feature, which may be desirable for prodcutions systems where
      the risk of unclean shutdowns or unexpected system crashes is low.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      afd4672d
  24. 14 3月, 2009 1 次提交
    • E
      ext4: fix bogus BUG_ONs in in mballoc code · 8d03c7a0
      Eric Sandeen 提交于
      Thiemo Nagel reported that:
      
      # dd if=/dev/zero of=image.ext4 bs=1M count=2
      # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
        -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
      # mount -o loop image.ext4 mnt/
      # dd if=/dev/zero of=mnt/file
      
      oopsed, with a BUG_ON in ext4_mb_normalize_request because
      size == EXT4_BLOCKS_PER_GROUP
      
      It appears to me (esp. after talking to Andreas) that the BUG_ON
      is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
      be allowed, though larger sizes do indicate a problem.
      
      Fix that an another (apparently rare) codepath with a similar check.
      Reported-by: NThiemo Nagel <thiemo.nagel@ph.tum.de>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8d03c7a0
  25. 13 3月, 2009 1 次提交
  26. 11 3月, 2009 1 次提交
    • E
      ext4: fix header check in ext4_ext_search_right() for deep extent trees. · 395a87bf
      Eric Sandeen 提交于
      The ext4_ext_search_right() function is confusing; it uses a
      "depth" variable which is 0 at the root and maximum at the leaves, 
      but the on-disk metadata uses a "depth" (actually eh_depth) which
      is opposite: maximum at the root, and 0 at the leaves.
      
      The ext4_ext_check_header() function is given a depth and checks
      the header agaisnt that depth; it expects the on-disk semantics,
      but we are giving it the opposite in the while loop in this 
      function.  We should be giving it the on-disk notion of "depth"
      which we can get from (p_depth - depth) - and if you look, the last
      (more commonly hit) call to ext4_ext_check_header() does just this.
      
      Sending in the wrong depth results in (incorrect) messages
      about corruption:
      
      EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
      in inode #2621457: unexpected eh_depth - magic f30a, entries 340,
      max 340(0), depth 1(2)
      
      http://bugzilla.kernel.org/show_bug.cgi?id=12821Reported-by: NDavid Dindorp <ddi@dubex.dk>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      395a87bf
  27. 05 3月, 2009 1 次提交