1. 28 10月, 2009 2 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
    • J
      block: get rid of the WRITE_ODIRECT flag · 1af60fbd
      Jeff Moyer 提交于
      Hi,
      
      The WRITE_ODIRECT flag is only used in one place, and that code path
      happens to also call blk_run_address_space.  The introduction of this
      flag, then, could result in the device being unplugged twice for every
      I/O.
      
      Further, with the batching changes in the next patch, we don't want an
      O_DIRECT write to imply a queue unplug.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1af60fbd
  2. 07 10月, 2009 1 次提交
    • N
      block: Seperate read and write statistics of in_flight requests v2 · 316d315b
      Nikanth Karthikesan 提交于
      Commit a9327cac added seperate read
      and write statistics of in_flight requests. And exported the number
      of read and write requests in progress seperately through sysfs.
      
      But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
      output from "iostat -kx 2". Global values for service time and
      utilization were garbage. For interval values, utilization was always
      100%, and service time is higher than normal.
      
      So this was reverted by commit 0f78ab98
      
      The problem was in part_round_stats_single(), I missed the following:
              if (now == part->stamp)
                      return;
      
      -       if (part->in_flight) {
      +       if (part_in_flight(part)) {
                      __part_stat_add(cpu, part, time_in_queue,
                                      part_in_flight(part) * (now - part->stamp));
                      __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
      
      With this chunk included, the reported regression gets fixed.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      
      --
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      316d315b
  3. 05 10月, 2009 2 次提交
    • A
      a99bbaf5
    • J
      Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98
      Jens Axboe 提交于
      This reverts commit a9327cac.
      
      Corrado Zoccolo <czoccolo@gmail.com> reports:
      
      "with 2.6.32-rc1 I started getting the following strange output from
      "iostat -kx 2":
      Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10,70    0,00    3,16   15,75    0,00   70,38
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda              18,22     0,00    0,67    0,01    14,77     0,02
      43,94     0,01   10,53 39043915,03 2629219,87
      sdb              60,89     9,68   50,79    3,04  1724,43    50,52
      65,95     0,70   13,06 488437,47 2629219,87
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 2,72    0,00    0,74    0,00    0,00   96,53
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6,68    0,00    0,99    0,00    0,00   92,33
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 4,40    0,00    0,73    1,47    0,00   93,40
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     4,00    0,00    3,00     0,00    28,00
      18,67     0,06   19,50 333,33 100,00
      
      Global values for service time and utilization are garbage. For
      interval values, utilization is always 100%, and service time is
      higher than normal.
      
      I bisected it down to:
      [a9327cac] Seperate read and write
      statistics of in_flight requests
      and verified that reverting just that commit indeed solves the issue
      on 2.6.32-rc1."
      
      So until this is debugged, revert the bad commit.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0f78ab98
  4. 03 10月, 2009 2 次提交
  5. 02 10月, 2009 6 次提交
  6. 01 10月, 2009 3 次提交
  7. 30 9月, 2009 9 次提交
  8. 29 9月, 2009 1 次提交
  9. 30 9月, 2009 1 次提交
  10. 29 9月, 2009 9 次提交
    • F
      ext4: Avoid updating the inode table bh twice in no journal mode · 830156c7
      Frank Mayhar 提交于
      This is a cleanup of commit 91ac6f43.  Since ext4_mark_inode_dirty()
      has already called ext4_mark_iloc_dirty(), which in turn calls
      ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
      call ext4_do_update_inode() in no journal mode.  Indeed, it would be
      duplicated work.
      Reviewed-by: N"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NFrank Mayhar <fmayhar@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      830156c7
    • R
      nilfs2: fix missing initialization of i_dir_start_lookup member · 3cc811bf
      Ryusuke Konishi 提交于
      The i_dir_start_lookup field in nilfs_inode_info objects should be
      cleared when the objects are allocated, but the the initialization was
      missing in case of reading from disk.  This adds the initialization.
      
      Since the variable just gives a start page on directory lookups, the
      bug was nonfatal until now.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      3cc811bf
    • R
      nilfs2: fix missing zero-fill initialization of btree node cache · 1f28fcd9
      Ryusuke Konishi 提交于
      This will fix file system corruption which infrequently happens after
      mount.  The problem was reported from users with the title "[NILFS
      users] Fail to mount NILFS." (Message-ID:
      <200908211918.34720.yuri@itinteg.net>), and so forth.  I've also
      experienced the corruption multiple times on kernel 2.6.30 and 2.6.31.
      
      The problem turned out to be caused due to discordance between
      mapping->nrpages of a btree node cache and the actual number of pages
      hung on the cache; if the mapping->nrpages becomes zero even as it has
      pages, truncate_inode_pages() returns without doing anything.  Usually
      this is harmless except it may cause page leak, but garbage collection
      fairly infrequently sees a stale page remained in the btree node cache
      of DAT (i.e. disk address translation file of nilfs), and induces the
      corruption.
      
      I identified a missing initialization in btree node caches was the
      root cause.  This corrects the bug.
      
      I've tested this for kernel 2.6.30 and 2.6.31.
      Reported-by: NYuri Chislov <yuri@itinteg.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable <stable@kernel.org>
      1f28fcd9
    • J
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik 提交于
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9ed74f2d
    • T
      ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first · f3ce8064
      Theodore Ts'o 提交于
      Move the check to make sure the original and donor inodes are
      different earlier, to avoid a potential deadlock by trying to lock the
      same inode twice.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f3ce8064
    • M
      ext4: async direct IO for holes and fallocate support · 8d5d02e6
      Mingming Cao 提交于
      For async direct IO that covers holes or fallocate, the end_io
      callback function now queued the convertion work on workqueue but
      don't flush the work rightaway as it might take too long to afford.
      
      But when fsync is called after all the data is completed, user expects
      the metadata also being updated before fsync returns.
      
      Thus we need to flush the conversion work when fsync() is called.
      This patch keep track of a listed of completed async direct io that
      has a work queued on workqueue.  When fsync() is called, it will go
      through the list and do the conversion.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      8d5d02e6
    • M
      ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O · 4c0425ff
      Mingming Cao 提交于
      Currently the DIO VFS code passes create = 0 when writing to the
      middle of file.  It does this to avoid block allocation for holes, so
      as not to expose stale data out when there is a parallel buffered read
      (which does not hold the i_mutex lock).  Direct I/O writes into holes
      falls back to buffered IO for this reason.
      
      Since preallocated extents are treated as holes when doing a
      get_block() look up (buffer is not mapped), direct IO over fallocate
      also falls back to buffered IO.  Thus ext4 actually silently falls
      back to buffered IO in above two cases, which is undesirable.
      
      To fix this, this patch creates unitialized extents when a direct I/O
      write into holes in sparse files, and registering an end_io callback which
      converts the uninitialized extent to an initialized extent after the
      I/O is completed.
      Singed-Off-By: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4c0425ff
    • M
      ext4: Split uninitialized extents for direct I/O · 0031462b
      Mingming Cao 提交于
      When writing into an unitialized extent via direct I/O, and the direct
      I/O doesn't exactly cover the unitialized extent, split the extent
      into uninitialized and initialized extents before submitting the I/O.
      This avoids needing to deal with an ENOSPC error in the end_io
      callback that gets used for direct I/O.
      
      When the IO is complete, the written extent will be marked as initialized.
      
      Singed-Off-By: Mingming Cao <cmm@us.ibm.com> 
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0031462b
    • M
      ext4: release reserved quota when block reservation for delalloc retry · 9f0ccfd8
      Mingming Cao 提交于
      ext4_da_reserve_space() can reserve quota blocks multiple times if
      ext4_claim_free_blocks() fail and we retry the allocation. We should
      release the quota reservation before restarting.
      
      Bug found by Jan Kara.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9f0ccfd8
  11. 30 9月, 2009 1 次提交
    • T
      ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks · 55138e0b
      Theodore Ts'o 提交于
      Work around problems in the writeback code to force out writebacks in
      larger chunks than just 4mb, which is just too small.  This also works
      around limitations in the ext4 block allocator, which can't allocate
      more than 2048 blocks at a time.  So we need to defeat the round-robin
      characteristics of the writeback code and try to write out as many
      blocks in one inode before allowing the writeback code to move on to
      another inode.  We add a a new per-filesystem tunable,
      max_writeback_mb_bump, which caps this to a default of 128mb per
      inode.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      55138e0b
  12. 28 9月, 2009 2 次提交
  13. 27 9月, 2009 1 次提交