1. 12 7月, 2008 6 次提交
    • M
      ext4: delayed allocation i_blocks fix for stat · 3e3398a0
      Mingming Cao 提交于
      Right now i_blocks is not getting updated until the blocks are actually
      allocaed on disk.  This means with delayed allocation, right after files
      are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
      also shows the files taking zero space, which is highly confusing to the
      user.
      
      Since delayed allocation already keeps track of per-inode total
      number of blocks that are subject to delayed allocation, this patch fix
      this by using that to adjust the value returned by stat(2). When real
      block allocation is done, the i_blocks will get updated. Since the
      reserved blocks for delayed allocation will be decreased, this will be
      keep value returned by stat(2) consistent.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e3398a0
    • M
      ext4: fix delalloc i_disksize early update issue · 632eaeab
      Mingming Cao 提交于
      Ext4_da_write_end() used walk_page_buffers() with a callback function of
      ext4_bh_unmapped_or_delay() to check if it extended the file size
      without allocating any blocks (since in this case i_disksize needs to be
      updated).  However, this is didn't work proprely because the buffer head
      has not been marked dirty yet --- this is done later in
      block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
      always return false.
      
      In addition, walk_page_buffers() checks all of the buffer heads covering
      the page, and the only buffer_head that should be checked is the one
      covering the end of the write.  Otherwise, given a 1k blocksize
      filesystem and a 4k page size, the buffer head covering the first 1k
      stripe of the file could be unmapped (because it was a sparse file), and
      the second or third buffer_head covering that page could be mapped, and
      using walk_page_buffers() would fail in this case since it would stop at
      the first unmapped buffer_head and return true.
      
      The core problem is that walk_page_buffers() was intended to do work in
      a callback function, and a non-zero return value indicated a failure,
      which termined the walk of the buffer heads covering the page.  It was
      not intended to be used with a boolean function, such as
      ext4_bh_unmapped_or_delay().
      
      Add addtional fix from Aneesh to protect i_disksize update rave with truncate.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      632eaeab
    • A
      ext4: Handle page without buffers in ext4_*_writepage() · f0e6c985
      Aneesh Kumar K.V 提交于
      It can happen that buffers are removed from the page before it gets
      marked dirty and then is passed to writepage().  In writepage() we just
      initialize the buffers and check whether they are mapped and non
      delay. If they are mapped and non delay we write the page. Otherwise we
      mark them dirty.  With this change we don't do block allocation at all
      in ext4_*_write_page.
      
      writepage() can get called under many condition and with a locking order
      of journal_start -> lock_page, we should not try to allocate blocks in
      writepage() which get called after taking page lock.  writepage() can
      get called via shrink_page_list even with a journal handle which was
      created for doing inode update.  For example when doing
      ext4_da_write_begin we create a journal handle with credit 1 expecting a
      i_disksize update for the inode. But ext4_da_write_begin can cause
      shrink_page_list via _grab_page_cache. So having a valid handle via
      ext4_journal_current_handle is not a guarantee that we can use the
      handle for block allocation in writepage, since we shouldn't be using
      credits that had been reserved for other updates.  That it could result
      in we running out of credits when we update inodes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f0e6c985
    • A
      ext4: Add ordered mode support for delalloc · cd1aac32
      Aneesh Kumar K.V 提交于
      This provides a new ordered mode implementation which gets rid of using
      buffer heads to enforce the ordering between metadata change with the
      related data chage.  Instead, in the new ordering mode, it keeps track
      of all of the inodes touched by each transaction on a list, and when
      that transaction is committed, it flushes all of the dirty pages for
      those inodes.  In addition, the new ordered mode reverses the lock
      ordering of the page lock and transaction lock, which provides easier
      support for delayed allocation.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cd1aac32
    • M
      ext4: Invert lock ordering of page_lock and transaction start in delalloc · 61628a3f
      Mingming Cao 提交于
      With the reverse locking, we need to start a transation before taking
      the page lock, so in ext4_da_writepages() we need to break the write-out
      into chunks, and restart the journal for each chunck to ensure the
      write-out fits in a single transaction.
      
      Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      which fixes delalloc sync hang with journal lock inversion, and address
      the performance regression issue.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      61628a3f
    • A
      mm: Add range_cont mode for writeback · 06d6cf69
      Aneesh Kumar K.V 提交于
      Filesystems like ext4 needs to start a new transaction in
      the writepages for block allocation. This happens with delayed
      allocation and there is limit to how many credits we can request
      from the journal layer. So we call write_cache_pages multiple
      times with wbc->nr_to_write set to the maximum possible value
      limitted by the max journal credits available.
      
      Add a new mode to writeback that enables us to handle this
      behaviour. In the new mode we update the wbc->range_start
      to point to the new offset to be written. Next call to
      call to write_cache_pages will start writeout from specified
      range_start offset. In the new mode we also limit writing
      to the specified wbc->range_end.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06d6cf69
  2. 15 7月, 2008 1 次提交
    • M
      ext4: delayed allocation ENOSPC handling · d2a17637
      Mingming Cao 提交于
      This patch does block reservation for delayed
      allocation, to avoid ENOSPC later at page flush time.
      
      Blocks(data and metadata) are reserved at da_write_begin()
      time, the freeblocks counter is updated by then, and the number of
      reserved blocks is store in per inode counter.
              
      At the writepage time, the unused reserved meta blocks are returned
      back. At unlink/truncate time, reserved blocks are properly released.
      
      Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      to fix the oldallocator block reservation accounting with delalloc, added
      lock to guard the counters and also fix the reservation for meta blocks.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d2a17637
  3. 12 7月, 2008 28 次提交
  4. 14 7月, 2008 1 次提交
    • M
      jbd2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction · 530576bb
      Mingming Cao 提交于
      journal_try_to_free_buffers() could race with jbd commit transaction
      when the later is holding the buffer reference while waiting for the
      data buffer to flush to disk. If the caller of
      journal_try_to_free_buffers() request tries hard to release the buffers,
      it will treat the failure as error and return back to the caller. We
      have seen the directo IO failed due to this race.  Some of the caller of
      releasepage() also expecting the buffer to be dropped when passed with
      GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().
      
      With this patch, if the caller is passing the GFP_KERNEL to indicating
      this call could wait, in case of try_to_free_buffers() failed, let's
      waiting for journal_commit_transaction() to finish commit the current
      committing transaction , then try to free those buffers again with
      journal locked.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> 
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      530576bb
  5. 12 7月, 2008 2 次提交
    • J
      ext4: New inode allocation for FLEX_BG meta-data groups. · 772cb7c8
      Jose R. Santos 提交于
      This patch mostly controls the way inode are allocated in order to
      make ialloc aware of flex_bg block group grouping.  It achieves this
      by bypassing the Orlov allocator when block group meta-data are packed
      toghether through mke2fs.  Since the impact on the block allocator is
      minimal, this patch should have little or no effect on other block
      allocation algorithms. By controlling the inode allocation, it can
      basically control where the initial search for new block begins and
      thus indirectly manipulate the block allocator.
      
      This allocator favors data and meta-data locality so the disk will
      gradually be filled from block group zero upward.  This helps improve
      performance by reducing seek time.  Since the group of inode tables
      within one flex_bg are treated as one giant inode table, uninitialized
      block groups would not need to partially initialize as many inode
      table as with Orlov which would help fsck time as the filesystem usage
      goes up.
      Signed-off-by: NJose R. Santos <jrs@us.ibm.com>
      Signed-off-by: NValerie Clement <valerie.clement@bull.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      772cb7c8
    • T
      jbd2: Add commit time into the commit block · 736603ab
      Theodore Ts'o 提交于
      Carlo Wood has demonstrated that it's possible to recover deleted
      files from the journal.  Something that will make this easier is if we
      can put the time of the commit into commit block.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      736603ab
  6. 14 7月, 2008 2 次提交