1. 28 10月, 2010 22 次提交
    • T
      ext4: simplify ext4_writepage() · a42afc5f
      Theodore Ts'o 提交于
      The actual code in ext4_writepage() is unnecessarily convoluted.
      Simplify it so it is easier to understand, but otherwise logically
      equivalent.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a42afc5f
    • T
      ext4: call mpage_da_submit_io() from mpage_da_map_blocks() · 5a87b7a5
      Theodore Ts'o 提交于
      Eventually we need to completely reorganize the ext4 writepage
      callpath, but for now, we simplify things a little by calling
      mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
      places where we call mpage_da_map_blocks() it is followed up by a call
      to mpage_da_submit_io().
      
      We're also a wee bit better with respect to error handling, but there
      are still a number of issues where it's not clear what the right thing
      is to do with ext4 functions deep in the writeback codepath fails.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5a87b7a5
    • T
      ext4: use KMEM_CACHE instead of kmem_cache_create · 16828088
      Theodore Ts'o 提交于
      Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
      cache.  This slab tends to be fairly static, so it shouldn't be marked
      as likely to have free pages that can be reclaimed.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      16828088
    • T
      ext4: use search_dirblock() in ext4_dx_find_entry() · 7845c049
      Theodore Ts'o 提交于
      Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
      easier to read, and it takes advantage of common code.  It also saves
      100 bytes or so of text space.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Brad Spengler <spender@grsecurity.net>
      7845c049
    • T
      ext4: avoid uninitialized memory references in ext3_htree_next_block() · 8941ec8b
      Theodore Ts'o 提交于
      If the first block of htree directory is missing '.' or '..' but is
      otherwise a valid directory, and we do a lookup for '.' or '..', it's
      possible to dereference an uninitialized memory pointer in
      ext4_htree_next_block().
      
      We avoid this by moving the special case from ext4_dx_find_entry() to
      ext4_find_entry(); this also means we can optimize ext4_find_entry()
      slightly when NFS looks up "..".
      
      Thanks to Brad Spengler for pointing a Clang warning that led me to
      look more closely at this code.  The warning was harmless, but it was
      useful in pointing out code that was too ugly to live.  This warning was
      also reported by Roman Borisov.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Brad Spengler <spender@grsecurity.net>
      8941ec8b
    • E
      ext4: remove unused ext4_sb_info members · 640e9396
      Eric Sandeen 提交于
      Not that these take up a lot of room, but the structure is long enough
      as it is, and there's no need to confuse people with these various
      undocumented & unused structure members...
      Signed-off-by: NEric Sandeen <sandeen@redaht.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      640e9396
    • E
      ext4: queue conversion after adding to inode's completed IO list · c999af2b
      Eric Sandeen 提交于
      By queuing the io end on the unwritten workqueue before adding it
      to our inode's list of completed IOs, I think we run the risk
      of the work getting completed, and the IO freed, before we try
      to add it to the inode's i_completed_io_list.
      
      It should be safe to add it to the inode's list of completed
      IOs, and -then- queue it for completion, I think.
      
      Thanks to Dave Chinner for pointing out the race.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c999af2b
    • E
      ext4: don't use ext4_allocation_contexts for tracing · 3e1e5f50
      Eric Sandeen 提交于
      Many tracepoints were populating an ext4_allocation_context
      to pass in, but this requires a slab allocation even when
      tracepoints are off.  In fact, 4 of 5 of these allocations
      were only for tracing.  In addition, we were only using a
      small fraction of the 144 bytes of this structure for this
      purpose.
      
      We can do away with all these alloc/frees of the ac and
      simply pass in the bits we care about, instead.
      
      I tested this by turning on tracing and running through
      xfstests on x86_64.  I did not actually do anything with
      the trace output, however.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e1e5f50
    • T
      ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc
      Toshiyuki Okajima 提交于
      On linux-2.6.36-rc2, if we execute the following script, we can hang
      the system when the /bin/sync command is executed:
      
      ========================================================================
      #!/bin/sh
      
      echo -n "HANG UP TEST: "
      /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
      /sbin/mkfs.ext4 -Fq /tmp/img
      /bin/mount -o loop -t ext4 /tmp/img /mnt
      /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
      seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
      /bin/sync
      /bin/umount /mnt
      echo "DONE"
      exit 0
      ========================================================================
      
      We can see the following backtrace if we get the kdump when this
      hangup occurs:
      
      ======================================================================
      kthread()
      => bdi_writeback_thread()
         => wb_do_writeback()
            => wb_writeback()
               => writeback_inodes_wb()
                  => writeback_sb_inodes()
                     => writeback_single_inode()
                        => ext4_da_writepages()  ---+ 
                                      ^ infinite    |
                                      |   loop      |
                                      +-------------+
      ======================================================================
      
      The reason why this hangup happens is described as follows:
      1) We write the last extent block of the file whose size is the filesystem 
         maximum size.
      2) "BH_Delay" flag is set on the buffer_head of its block.
      3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
         - the member, "m_len" of struct mpage_da_data is 1
        mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
        cannot clear "BH_Delay" flag of the buffer_head because the type of
        m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
      
        Therefore an infinite loop occurs because ext4_da_writepages()
        cannot write the page (which corresponds to the block) since
        "BH_Delay" flag isn't cleared.
      ----------------------------------------------------------------------
      static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
      				struct ext4_map_blocks *map)
      {
      ...
      	int blocks = map->m_len;
      ...
      		do {
      			// cur_logical = 4294967295
      			// map->m_lblk = 4294967295
      			// blocks = 1
      			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
      			// (cur_logical >= map->m_lblk + blocks) => true
      			if (cur_logical >= map->m_lblk + blocks)
      				break;
      ----------------------------------------------------------------------
      
      NOTE: Mounting with the nodelalloc option will avoid this codepath,
      and thus, avoid this hang
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0c9169cc
    • T
      ext4: improve llseek error handling for overly large seek offsets · e0d10bfa
      Toshiyuki Okajima 提交于
      The llseek system call should return EINVAL if passed a seek offset
      which results in a write error.  What this maximum offset should be
      depends on whether or not the huge_file file system feature is set,
      and whether or not the file is extent based or not.
      
      
      If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
      written (write systemcall) is different from the maximum size which can be 
      sought (lseek systemcall).
      
      For example, the following 2 cases demonstrates the differences
      between the maximum size which can be written, versus the seek offset
      allowed by the llseek system call:
      
      #1: mkfs.ext3 <dev>; mount -t ext4 <dev>
      #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
      
      Table. the max file size which we can write or seek
             at each filesystem feature tuning and file flag setting
      +============+===============================+===============================+
      | \ File flag|                               |                               |
      |      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
      |case       \|                               |                               |
      +------------+-------------------------------+-------------------------------+
      | #1         |   write:      2194719883264   | write:       --------------   |
      |            |   seek:       2199023251456   | seek:        --------------   |
      +------------+-------------------------------+-------------------------------+
      | #2         |   write:      4402345721856   | write:       17592186044415   |
      |            |   seek:      17592186044415   | seek:        17592186044415   |
      +------------+-------------------------------+-------------------------------+
      
      The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
      (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
      maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
      (llseek of ext4_file_operations is generic_file_llseek which uses
      sb->s_maxbytes.)
      
      Therefore we create ext4 llseek function which uses 2 maxbytes.
      
      The new own function originates from generic_file_llseek().
      If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
      inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      e0d10bfa
    • M
      ext4: don't update sb journal_devnum when RO dev · c41303ce
      Maciej Żenczykowski 提交于
      An ext4 filesystem on a read-only device, with an external journal
      which is at a different device number then recorded in the superblock
      will fail to honor the read-only setting of the device and trigger
      a superblock update (write).
      
      For example:
        - ext4 on a software raid which is in read-only mode
        - external journal on a read-write device which has changed device num
        - attempt to mount with -o journal_dev=<new_number>
        - hits BUG_ON(mddev->ro = 1) in md.c
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c41303ce
    • L
      ext4: use sb_issue_zeroout in ext4_ext_zeroout · 2407518d
      Lukas Czerner 提交于
      Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
      own approach to zero out extents.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2407518d
    • L
      ext4: use sb_issue_zeroout in setup_new_group_blocks · a31437b8
      Lukas Czerner 提交于
      Use sb_issue_zeroout to zero out inode table and descriptor table
      blocks instead of old approach which involves journaling.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a31437b8
    • L
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner 提交于
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      
      Add lazy_itable_init to the feature list.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      857ac889
    • L
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner 提交于
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bfff6873
    • S
      ext4: fix NULL pointer dereference in print_daily_error_info() · a1c6c569
      Sergey Senozhatsky 提交于
      Fix NULL pointer dereference in print_daily_error_info, when   
      called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
      reporting timer in ext4_put_super.
      
      Google-Bug-Id: 3017663
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a1c6c569
    • L
      ext4: don't hold spinlock while calling ext4_issue_discard() · 53fdcf99
      Lukas Czerner 提交于
      We can't hold the block group spinlock because we ext4_issue_discard()
      calls wait and hence can get rescheduled.
      
      Google-Bug-Id: 3017678
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      53fdcf99
    • L
      ext4: check for negative error code from sb_issue_discard · 58298709
      Lukas Czerner 提交于
      sb_issue_discard() is returning negative error code, so check for
      -EOPNOTSUPP.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      58298709
    • E
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen 提交于
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • E
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen 提交于
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      659c6009
    • C
      ext4: use dedicated slab caches for group_info structures · fb1813f4
      Curt Wohlgemuth 提交于
      ext4_group_info structures are currently allocated with kmalloc().
      With a typical 4K block size, these are 136 bytes each -- meaning
      they'll each consume a 256-byte slab object.  On a system with many
      ext4 large partitions, that's a lot of wasted kernel slab space.
      (E.g., a single 1TB partition will have about 8000 block groups, using
      about 2MB of slab, of which nearly 1MB is wasted.)
      
      This patch creates an array of slab pointers created as needed --
      depending on the superblock block size -- and uses these slabs to
      allocate the group info objects.
      
      Google-Bug-Id: 2980809
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fb1813f4
    • T
      ext4: fix EOFBLOCKS_FL handling · 58590b06
      Theodore Ts'o 提交于
      It turns out we have several problems with how EOFBLOCKS_FL is
      handled.  First of all, there was a fencepost error where we were not
      clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
      but rather when we allocate the next block _after_ the uninitalized
      block.  Secondly we were not testing to see if we needed to clear the
      EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
      an uninitialized block (which is the most common case).
      
      Google-Bug-Id: 2928259
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      58590b06
  2. 10 8月, 2010 5 次提交
    • A
      mbcache: Remove unused features · 2aec7c52
      Andreas Gruenbacher 提交于
      The mbcache code was written to support a variable number of indexes,
      but all the existing users use exactly one index.  Simplify to code to
      support only that case.
      
      There are also no users of the cache entry free operation, and none of
      the users keep extra data in cache entries.  Remove those features as
      well.
      Signed-off-by: NAndreas Gruenbacher <agruen@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2aec7c52
    • A
      convert ext4 to ->evict_inode() · 0930fcc1
      Al Viro 提交于
      pretty much brute-force...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0930fcc1
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
    • C
      introduce __block_write_begin · 6e1db88d
      Christoph Hellwig 提交于
      Split up the block_write_begin implementation - __block_write_begin is a new
      trivial wrapper for block_prepare_write that always takes an already
      allocated page and can be either called from block_write_begin or filesystem
      code that already has a page allocated.  Remove the handling of already
      allocated pages from block_write_begin after switching all callers that
      do it to __block_write_begin.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6e1db88d
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  3. 06 8月, 2010 2 次提交
    • A
      ext4: Adding error check after calling ext4_mb_regular_allocator() · 6c7a120a
      Aditya Kali 提交于
      If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an
      error. This error is returned to the caller,
      ext4_mb_regular_allocator() and then to ext4_mb_new_blocks().  But
      ext4_mb_new_blocks() did not check for the return value of
      ext4_mb_regular_allocator() and would repeatedly try to load the
      bitmap block. The fix simply catches the return value and exits out of
      the 'repeat' loop after cleanup.
      
      We also take the opportunity to clean up the error handling in
      ext4_mb_new_blocks().
      
      Google-Bug-Id: 2853530
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6c7a120a
    • J
      ext4: Fix dirtying of journalled buffers in data=journal mode · 56d35a4c
      Jan Kara 提交于
      In data=journal mode, we still use block_write_begin() to prepare
      page for writing. This function can occasionally mark buffer dirty
      which violates journalling assumptions - when a buffer is part of
      a transaction, it should be dirty and a buffer can be already part
      of a forget list of some transaction when block_write_begin()
      gets called. This violation of journalling assumptions then results
      in "JBD: Spotted dirty metadata buffer..." warnings.
      
      In fact, temporary dirtying the buffer while the page is still locked
      does not really cause problems to the journalling because we won't write
      the buffer until the page gets unlocked. So we just have to make sure
      to clear dirty bits before unlocking the page.
      Signed-off-by: NJan Kara <jack@suse.cz>
      56d35a4c
  4. 05 8月, 2010 1 次提交
    • E
      ext4: re-inline ext4_rec_len_(to|from)_disk functions · 0cfc9255
      Eric Sandeen 提交于
      commit 3d0518f4, "ext4: New rec_len encoding for very
      large blocksizes" made several changes to this path, but from
      a perf perspective, un-inlining ext4_rec_len_from_disk() seems
      most significant.  This function is called from ext4_check_dir_entry(),
      which on a file-creation workload is called extremely often.
      
      I tested this with bonnie:
      
      # bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32
      
      (this does 200 iterations) and got this for the file creations:
      
      ext4 stock:   Average =  21206.8 files/s
      ext4 inlined: Average =  22346.7 files/s  (+5%)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0cfc9255
  5. 04 8月, 2010 2 次提交
  6. 02 8月, 2010 3 次提交
  7. 30 7月, 2010 1 次提交
  8. 27 7月, 2010 4 次提交