1. 28 10月, 2010 3 次提交
    • T
      ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc
      Toshiyuki Okajima 提交于
      On linux-2.6.36-rc2, if we execute the following script, we can hang
      the system when the /bin/sync command is executed:
      
      ========================================================================
      #!/bin/sh
      
      echo -n "HANG UP TEST: "
      /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
      /sbin/mkfs.ext4 -Fq /tmp/img
      /bin/mount -o loop -t ext4 /tmp/img /mnt
      /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
      seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
      /bin/sync
      /bin/umount /mnt
      echo "DONE"
      exit 0
      ========================================================================
      
      We can see the following backtrace if we get the kdump when this
      hangup occurs:
      
      ======================================================================
      kthread()
      => bdi_writeback_thread()
         => wb_do_writeback()
            => wb_writeback()
               => writeback_inodes_wb()
                  => writeback_sb_inodes()
                     => writeback_single_inode()
                        => ext4_da_writepages()  ---+ 
                                      ^ infinite    |
                                      |   loop      |
                                      +-------------+
      ======================================================================
      
      The reason why this hangup happens is described as follows:
      1) We write the last extent block of the file whose size is the filesystem 
         maximum size.
      2) "BH_Delay" flag is set on the buffer_head of its block.
      3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
         - the member, "m_len" of struct mpage_da_data is 1
        mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
        cannot clear "BH_Delay" flag of the buffer_head because the type of
        m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
      
        Therefore an infinite loop occurs because ext4_da_writepages()
        cannot write the page (which corresponds to the block) since
        "BH_Delay" flag isn't cleared.
      ----------------------------------------------------------------------
      static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
      				struct ext4_map_blocks *map)
      {
      ...
      	int blocks = map->m_len;
      ...
      		do {
      			// cur_logical = 4294967295
      			// map->m_lblk = 4294967295
      			// blocks = 1
      			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
      			// (cur_logical >= map->m_lblk + blocks) => true
      			if (cur_logical >= map->m_lblk + blocks)
      				break;
      ----------------------------------------------------------------------
      
      NOTE: Mounting with the nodelalloc option will avoid this codepath,
      and thus, avoid this hang
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0c9169cc
    • E
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen 提交于
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • E
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen 提交于
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      659c6009
  2. 10 8月, 2010 4 次提交
    • A
      convert ext4 to ->evict_inode() · 0930fcc1
      Al Viro 提交于
      pretty much brute-force...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0930fcc1
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
    • C
      introduce __block_write_begin · 6e1db88d
      Christoph Hellwig 提交于
      Split up the block_write_begin implementation - __block_write_begin is a new
      trivial wrapper for block_prepare_write that always takes an already
      allocated page and can be either called from block_write_begin or filesystem
      code that already has a page allocated.  Remove the handling of already
      allocated pages from block_write_begin after switching all callers that
      do it to __block_write_begin.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6e1db88d
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  3. 06 8月, 2010 1 次提交
    • J
      ext4: Fix dirtying of journalled buffers in data=journal mode · 56d35a4c
      Jan Kara 提交于
      In data=journal mode, we still use block_write_begin() to prepare
      page for writing. This function can occasionally mark buffer dirty
      which violates journalling assumptions - when a buffer is part of
      a transaction, it should be dirty and a buffer can be already part
      of a forget list of some transaction when block_write_begin()
      gets called. This violation of journalling assumptions then results
      in "JBD: Spotted dirty metadata buffer..." warnings.
      
      In fact, temporary dirtying the buffer while the page is still locked
      does not really cause problems to the journalling because we won't write
      the buffer until the page gets unlocked. So we just have to make sure
      to clear dirty bits before unlocking the page.
      Signed-off-by: NJan Kara <jack@suse.cz>
      56d35a4c
  4. 04 8月, 2010 1 次提交
  5. 30 7月, 2010 1 次提交
  6. 27 7月, 2010 9 次提交
  7. 30 6月, 2010 1 次提交
  8. 15 6月, 2010 2 次提交
  9. 14 6月, 2010 1 次提交
  10. 05 6月, 2010 1 次提交
  11. 22 5月, 2010 1 次提交
  12. 17 5月, 2010 8 次提交
  13. 16 5月, 2010 3 次提交
    • E
      ext4: don't use quota reservation for speculative metadata · 72b8ab9d
      Eric Sandeen 提交于
      Because we can badly over-reserve metadata when we
      calculate worst-case, it complicates things for quota, since
      we must reserve and then claim later, retry on EDQUOT, etc.
      Quota is also a generally smaller pool than fs free blocks,
      so this over-reservation hurts more, and more often.
      
      I'm of the opinion that it's not the worst thing to allow
      metadata to push a user slightly over quota.  This simplifies
      the code and avoids the false quota rejections that result
      from worst-case speculation.
      
      This patch stops the speculative quota-charging for
      worst-case metadata requirements, and just charges quota
      when the blocks are allocated at writeout.  It also is
      able to remove the try-again loop on EDQUOT.
      
      This patch has been tested indirectly by running the xfstests
      suite with a hack to mount & enable quota prior to the test.
      
      I also did a more specific test of fragmenting freespace
      and then doing a large delalloc write under quota; quota
      stopped me at the right amount of file IO, and then the
      writeout generated enough metadata (due to the fragmentation)
      that it put me slightly over quota, as expected.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      72b8ab9d
    • E
      ext4: don't scan/accumulate more pages than mballoc will allocate · c445e3e0
      Eric Sandeen 提交于
      There was a bug reported on RHEL5 that a 10G dd on a 12G box
      had a very, very slow sync after that.
      
      At issue was the loop in write_cache_pages scanning all the way
      to the end of the 10G file, even though the subsequent call
      to mpage_da_submit_io would only actually write a smallish amt; then
      we went back to the write_cache_pages loop ... wasting tons of time
      in calling __mpage_da_writepage for thousands of pages we would
      just revisit (many times) later.
      
      Upstream it's not such a big issue for sys_sync because we get
      to the loop with a much smaller nr_to_write, which limits the loop.
      
      However, talking with Aneesh he realized that fsync upstream still
      gets here with a very large nr_to_write and we face the same problem.
      
      This patch makes mpage_add_bh_to_extent stop the loop after we've
      accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
      causes the write_cache_pages loop to break.
      
      Repeating the test with a dirty_ratio of 80 (to leave something for
      fsync to do), I don't see huge IO performance gains, but the reduction
      in cpu usage is striking: 80% usage with stock, and 2% with the
      below patch.  Instrumenting the loop in write_cache_pages clearly
      shows that we are wasting time here.
      
      Eventually we need to change mpage_da_map_pages() also submit its I/O
      to the block layer, subsuming mpage_da_submit_io(), and then change it
      call ext4_get_blocks() multiple times.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c445e3e0
    • D
      ext4: fix quota accounting in case of fallocate · 35121c98
      Dmitry Monakhov 提交于
      allocated_meta_data is already included in 'used' variable.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      35121c98
  14. 04 4月, 2010 2 次提交
    • C
      ext4: Fix buffer head leaks after calls to ext4_get_inode_loc() · fd2dd9fb
      Curt Wohlgemuth 提交于
      Calls to ext4_get_inode_loc() returns with a reference to a buffer
      head in iloc->bh.  The callers of this function in ext4_write_inode()
      when in no journal mode and in ext4_xattr_fiemap() don't release the
      buffer head after using it.
      
      Addresses-Google-Bug: #2548165
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fd2dd9fb
    • C
      ext4: Fix possible lost inode write in no journal mode · 8b472d73
      Curt Wohlgemuth 提交于
      In the no-journal case, ext4_write_inode() will fetch the bh and call
      sync_dirty_buffer() on it.  However, if the bh has already been
      written and the bh reclaimed for some other purpose, AND if the inode
      is the only one in the inode table block in use, then
      ext4_get_inode_loc() will not read the inode table block from disk,
      but as an optimization, fill the block with zero's assuming that its
      caller will copy in the on-disk version of the inode.  This is not
      done by ext4_write_inode(), so the contents of the inode can simply
      get lost.  The fix is to use __ext4_get_inode_loc() with in_mem set to
      0, instead of ext4_get_inode_loc().  Long term the API needs to be
      fixed so it's obvious why latter is not safe.
      
      Addresses-Google-Bug: #2526446
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8b472d73
  15. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  16. 15 3月, 2010 1 次提交