1. 10 6月, 2009 21 次提交
    • R
      nilfs2: use device's backing_dev_info for btree node caches · a53b4751
      Ryusuke Konishi 提交于
      Previously, default_backing_dev_info was used for the mapping of btree
      node caches.  This uses device dependent backing_dev_info to allow
      detailed control of the device for the btree node pages.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      a53b4751
    • R
      nilfs2: return EBUSY against delete request on snapshot · 30c25be7
      Ryusuke Konishi 提交于
      This helps userland programs like the rmcp command to distinguish
      error codes returned against a checkpoint removal request.
      
      Previously -EPERM was returned, and not discriminable from real
      permission errors.  This also allows removal of the latest checkpoint
      because the deletion leads to create a new checkpoint, and thus it's
      harmless for the filesystem.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      30c25be7
    • R
      nilfs2: enable sync_page method · e85dc1d5
      Ryusuke Konishi 提交于
      This adds a missing sync_page method which unplugs bio requests when
      waiting for page locks. This will improve read performance of nilfs.
      
      Here is a measurement result using dd command.
      
      Without this patch:
      
       # mount -t nilfs2 /dev/sde1 /test
       # dd if=/test/aaa of=/dev/null bs=512k
       1024+0 records in
       1024+0 records out
       536870912 bytes (537 MB) copied, 6.00688 seconds, 89.4 MB/s
      
      With this patch:
      
       # mount -t nilfs2 /dev/sde1 /test
       # dd if=/test/aaa of=/dev/null bs=512k
       1024+0 records in
       1024+0 records out
       536870912 bytes (537 MB) copied, 3.54998 seconds, 151 MB/s
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      e85dc1d5
    • R
      nilfs2: set bio unplug flag for the last bio in segment · 30bda0b8
      Ryusuke Konishi 提交于
      This sets BIO_RW_UNPLUG flag on the last bio of each segment during
      write.  The last bio should be unplugged immediately because the
      caller waits for the completion after the submission.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      30bda0b8
    • R
      nilfs2: allow future expansion of metadata read out via get info ioctl · 003ff182
      Ryusuke Konishi 提交于
      Nilfs has some ioctl commands to read out metadata from meta data
      files:
      
       - NILFS_IOCTL_GET_CPINFO for checkpoint file,
       - NILFS_IOCTL_GET_SUINFO for segment usage file, and
       - NILFS_IOCTL_GET_VINFO for Disk Address Transalation (DAT) file,
         respectively.
      
      Every routine on these metadata files is implemented so that it allows
      future expansion of on-disk format.  But, the above ioctl commands do
      not support expansion even though nilfs_argv structure can handle
      arbitrary size for data exchanged via ioctl.
      
      This allows future expansion of the following structures which give
      basic format of the "get information" ioctls:
      
       - struct nilfs_cpinfo
       - struct nilfs_suinfo
       - struct nilfs_vinfo
      
      So, this introduces forward compatility of such ioctl commands.
      
      In this patch, a sanity check in nilfs_ioctl_get_info() function is
      changed to accept larger data structure [1], and metadata read
      routines are rewritten so that they become compatible for larger
      structures; the routines will just ignore the remaining fields which
      the current version of nilfs doesn't know.
      
      [1] The ioctl function already has another upper limit (PAGE_SIZE
          against a structure, which appears in nilfs_ioctl_wrap_copy
          function), and this will not cause security problem.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      003ff182
    • H
      NILFS2: Pagecache usage optimization on NILFS2 · 258ef67e
      Hisashi Hifumi 提交于
      Hi,
      
      I introduced "is_partially_uptodate" aops for NILFS2.
      
      A page can have multiple buffers and even if a page is not uptodate, some buffers
      can be uptodate on pagesize != blocksize environment.
      This aops checks that all buffers which correspond to a part of a file
      that we want to read are uptodate. If so, we do not have to issue actual
      read IO to HDD even if a page is not uptodate because the portion we
      want to read are uptodate.
      "block_is_partially_uptodate" function is already used by ext2/3/4.
      With the following patch random read/write mixed workloads or random read after
      random write workloads can be optimized and we can get performance improvement.
      
      I did a performance test using the sysbench.
      
      1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --fil
      e-rw-ratio=1 run
      
      -2.6.30-rc5
      
      Test execution summary:
          total time:                          151.2907s
          total number of events:              200000
          total time taken by event execution: 2409.8387
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0120s
               max:                            0.9306s
               approx.  95 percentile:         0.0439s
      
      Threads fairness:
          events (avg/stddev):           12500.0000/238.52
          execution time (avg/stddev):   150.6149/0.01
      
      -2.6.30-rc5-patched
      
      Test execution summary:
          total time:                          140.8828s
          total number of events:              200000
          total time taken by event execution: 2240.8577
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0112s
               max:                            0.8750s
               approx.  95 percentile:         0.0418s
      
      Threads fairness:
          events (avg/stddev):           12500.0000/218.43
          execution time (avg/stddev):   140.0536/0.01
      
      arch: ia64
      pagesize: 16k
      
      Thanks.
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      258ef67e
    • R
      nilfs2: remove nilfs_btree_operations from btree mapping · 7cde31d7
      Ryusuke Konishi 提交于
      will remove indirect function calls using nilfs_btree_operations
      table.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      7cde31d7
    • R
      nilfs2: remove nilfs_direct_operations from direct mapping · 355c6b61
      Ryusuke Konishi 提交于
      will remove indirect function calls using nilfs_direct_operations
      table.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      355c6b61
    • R
      nilfs2: remove bmap pointer operations · d4b96157
      Ryusuke Konishi 提交于
      Previously, the bmap codes of nilfs used three types of function
      tables.  The abuse of indirect function calls decreased source
      readability and suffered many indirect jumps which would confuse
      branch prediction of processors.
      
      This eliminates one type of the function tables,
      nilfs_bmap_ptr_operations, which was used to dispatch low level
      pointer operations of the nilfs bmap.
      
      This adds a new integer variable "b_ptr_type" to nilfs_bmap struct,
      and uses the value to select the pointer operations.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      d4b96157
    • R
      nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct · 3033342a
      Ryusuke Konishi 提交于
      This will cut off 16 bytes from the nilfs_bmap struct which is
      embedded in the on-memory inode of nilfs.
      
      The b_high field was never used, and the b_low field stores a constant
      value which can be determined by whether the inode uses btree for
      block mapping or not.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      3033342a
    • R
      nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function · e473c1f2
      Ryusuke Konishi 提交于
      This indirect function is set to NULL only for gc cache inodes, but
      the gc cache inodes never call this function.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      e473c1f2
    • R
      nilfs2: move get block functions in bmap.c into btree codes · f198dbb9
      Ryusuke Konishi 提交于
      Two get block function for btree nodes, nilfs_bmap_get_block() and
      nilfs_bmap_get_new_block(), are called only from the btree codes.
      This relocation will increase opportunities of compiler optimization.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      f198dbb9
    • R
      nilfs2: remove nilfs_bmap_delete_block · 9f098900
      Ryusuke Konishi 提交于
      nilfs_bmap_delete_block() is a wrapper function calling
      nilfs_btnode_delete().  This removes it for simplicity.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      9f098900
    • R
      nilfs2: remove nilfs_bmap_put_block · 087d01b4
      Ryusuke Konishi 提交于
      nilfs_bmap_put_block() is a wrapper function calling brelse().  This
      eliminates the wrapper for simplicity.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      087d01b4
    • R
      nilfs2: remove header file for segment list operations · 654137dd
      Ryusuke Konishi 提交于
      This will eliminate obsolete list operations of nilfs_segment_entry
      structure which has been used to handle mutiple segment numbers.
      
      The patch ("nilfs2: remove list of freeing segments") removed use of
      the structure from the segment constructor code, and this patch
      simplifies the remaining code by integrating it into recovery.c.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      654137dd
    • R
      nilfs2: eliminate removal list of segments · 071cb4b8
      Ryusuke Konishi 提交于
      This will clean up the removal list of segments and the related
      functions from segment.c and ioctl.c, which have hurt code
      readability.
      
      This elimination is applied by using nilfs_sufile_updatev() previously
      introduced in the patch ("nilfs2: add sufile function that can modify
      multiple segment usages").
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      071cb4b8
    • R
      nilfs2: add sufile function that can modify multiple segment usages · dda54f4b
      Ryusuke Konishi 提交于
      This is a preparation for the later cleanup patch ("nilfs2: remove
      list of freeing segments").
      
      This adds nilfs_sufile_updatev() to sufile, which can modify multiple
      segment usages at a time.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      dda54f4b
    • R
      nilfs2: unify bmap operations starting use of indirect block address · d97a51a7
      Ryusuke Konishi 提交于
      This simplifies some low level functions of bmap.
      
      Three bmap pointer operations, nilfs_bmap_start_v(),
      nilfs_bmap_commit_v(), and nilfs_bmap_abort_v(), are unified into one
      nilfs_bmap_start_v() function. And the related indirect function calls
      are replaced with it.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      d97a51a7
    • R
      nilfs2: remove nilfs_dat_prepare_free function · 65822070
      Ryusuke Konishi 提交于
      This function is unused.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      65822070
    • J
      jbd: fix race in buffer processing in commit code · a61d90d7
      Jan Kara 提交于
      In commit code, we scan buffers attached to a transaction.  During this
      scan, we sometimes have to drop j_list_lock and then we recheck whether
      the journal buffer head didn't get freed by journal_try_to_free_buffers().
       But checking for buffer_jbd(bh) isn't enough because a new journal head
      could get attached to our buffer head.  So add a check whether the journal
      head remained the same and whether it's still at the same transaction and
      list.
      
      This is a nasty bug and can cause problems like memory corruption (use after
      free) or trigger various assertions in JBD code (observed).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a61d90d7
    • I
      autofs4: remove hashed check in validate_wait() · 463aea1a
      Ian Kent 提交于
      The recent ->lookup() deadlock correction required the directory inode
      mutex to be dropped while waiting for expire completion.  We were
      concerned about side effects from this change and one has been identified.
      
      I saw several error messages.
      
      They cause autofs to become quite confused and don't really point to the
      actual problem.
      
      Things like:
      
      handle_packet_missing_direct:1376: can't find map entry for (43,1827932)
      
      which is usually totally fatal (although in this case it wouldn't be
      except that I treat is as such because it normally is).
      
      do_mount_direct: direct trigger not valid or already mounted
      /test/nested/g3c/s1/ss1
      
      which is recoverable, however if this problem is at play it can cause
      autofs to become quite confused as to the dependencies in the mount tree
      because mount triggers end up mounted multiple times.  It's hard to
      accurately check for this over mounting case and automount shouldn't need
      to if the kernel module is doing its job.
      
      There was one other message, similar in consequence of this last one but I
      can't locate a log example just now.
      
      When checking if a mount has already completed prior to adding a new mount
      request to the wait queue we check if the dentry is hashed and, if so, if
      it is a mount point.  But, if a mount successfully completed while we
      slept on the wait queue mutex the dentry must exist for the mount to have
      completed so the test is not really needed.
      
      Mounts can also be done on top of a global root dentry, so for the above
      case, where a mount request completes and the wait queue entry has already
      been removed, the hashed test returning false can cause an incorrect
      callback to the daemon.  Also, d_mountpoint() is not sufficient to check
      if a mount has completed for the multi-mount case when we don't have a
      real mount at the base of the tree.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      463aea1a
  2. 07 6月, 2009 1 次提交
  3. 06 6月, 2009 2 次提交
    • A
      ext3/4 with synchronous writes gets wedged by Postfix · 72a43d63
      Al Viro 提交于
      OK, that's probably the easiest way to do that, as much as I don't like it...
      Since iget() et.al. will not accept I_FREEING (will wait to go away
      and restart), and since we'd better have serialization between new/free
      on fs data structures anyway, we can afford simply skipping I_FREEING
      et.al. in insert_inode_locked().
      
      We do that from new_inode, so it won't race with free_inode in any interesting
      ways and it won't race with iget (of any origin; nfsd or in case of fs
      corruption a lookup) since both still will wait for I_LOCK.
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NJan Kara <jack@suse.cz>
      Tested-by: NDavid Watson <dbwatson@ukfsn.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72a43d63
    • T
      Fix nobh_truncate_page() to not pass stack garbage to get_block() · 460bcf57
      Theodore Ts'o 提交于
      The nobh_truncate_page() function is used by ext2, exofs, and jfs.  Of
      these three, only ext2 and jfs's get_block() function pays attention
      to bh->b_size --- which is normally always the filesystem blocksize
      except when the get_block() function is called by either
      mpage_readpage(), mpage_readpages(), or the direct I/O routines in
      fs/direct_io.c.
      
      Unfortunately, nobh_truncate_page() does not initialize map_bh before
      calling the filesystem-supplied get_block() function.  So ext2 and jfs
      will try to calculate the number of blocks to map by taking stack
      garbage and shifting it left by inode->i_blkbits.  This should be
      *mostly* harmless (except the filesystem will do some unnneeded work)
      unless the stack garbage is less than filesystem's blocksize, in which
      case maxblocks will be zero, and the attempt to find out whether or
      not the filesystem has a hole at a given logical block will fail, and
      the page cache entry might not get zero'ed out.
      
      Also if the stack garbage in in map_bh->state happens to have the
      BH_Mapped bit set, there could be an attempt to call readpage() on a
      non-existent page, which could cause nobh_truncate_page() to return an
      error when it should not.
      
      Fix this by initializing map_bh->state and map_bh->size.
      
      Fortunately, it's probably fairly unlikely that ext2 and jfs users
      mount with nobh these days.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      460bcf57
  4. 05 6月, 2009 1 次提交
    • C
      Btrfs: Fix oops and use after free during space balancing · 44fb5511
      Chris Mason 提交于
      The btrfs allocator uses list_for_each to walk the available block
      groups when searching for free blocks.  It starts off with a hint
      to help find the best block group for a given allocation.
      
      The hint is resolved into a block group, but we don't properly check
      to make sure the block group we find isn't in the middle of being
      freed due to filesystem shrinking or balancing.  If it is being
      freed, the list pointers in it are bogus and can't be trusted.  But,
      the code happily goes along and uses them in the list_for_each loop,
      leading to all kinds of fun.
      
      The fix used here is to check to make sure the block group we find really
      is on the list before we use it.  list_del_init is used when removing
      it from the list, so we can do a proper check.
      
      The allocation clustering code has a similar bug where it will trust
      the block group in the current free space cluster.  If our allocation
      flags have changed (going from single spindle dup to raid1 for example)
      because the drives in the FS have changed, we're not allowed to use
      the old block group any more.
      
      The fix used here is to check the current cluster against the
      current allocation flags.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      44fb5511
  5. 04 6月, 2009 1 次提交
  6. 02 6月, 2009 3 次提交
  7. 30 5月, 2009 1 次提交
  8. 29 5月, 2009 4 次提交
  9. 28 5月, 2009 3 次提交
  10. 27 5月, 2009 2 次提交
  11. 24 5月, 2009 1 次提交
    • S
      [CIFS] Avoid open on possible directories since Samba now rejects them · 8db14ca1
      Steve French 提交于
      Small change (mostly formatting) to limit lookup based open calls to
      file create only.
      
      After discussion yesteday on samba-technical about the posix lookup
      regression,  and looking at a problem with cifs posix open to one
      particular Samba version, Jeff and JRA realized that Samba server's
      behavior changed in this area (posix open behavior on files vs.
      directories).   To make this behavior consistent, JRA just made a
      fix to Samba server to alter how it handles open of directories (now
      returning the equivalent of EISDIR instead of success). Since we don't
      know at lookup time whether the inode is a directory or file (and
      thus whether posix open will succeed with most current Samba server),
      this change avoids the posix open code on lookup open (just issues
      posix open on creates).    This gets the semantic benefits we want
      (atomicity, posix byte range locks, improved write semantics on newly
      created files) and file create still is fast, and we avoid the problem
      that Jeff noticed yesterday with "openat" (and some open directory
      calls) of non-cached directories to one version of Samba server, and
      will work with future Samba versions (which include the fix jra just
      pushed into Samba server).  I confirmed this approach with jra
      yesterday and with Shirish today.
      
      Posix open is only called (at lookup time) for file create now.
      For opens (rather than creates), because we do not know if it
      is a file or directory yet, and current Samba no longer allows
      us to do posix open on dirs, we could end up wasting an open call
      on what turns out to be a dir. For file opens, we wait to call posix
      open till cifs_open.  It could be added here (lookup) in the future
      but the performance tradeoff of the extra network request when EISDIR
      or EACCES is returned would have to be weighed against the 50%
      reduction in network traffic in the other paths.
      Reviewed-by: NShirish Pargaonkar <shirishp@us.ibm.com>
      Tested-by: NJeff Layton <jlayton@redhat.com>
      CC: Jeremy Allison <jra@samba.org>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      8db14ca1