1. 11 9月, 2010 1 次提交
    • G
      Track negative entries v3 · 5e98d492
      Goldwyn Rodrigues 提交于
      Track negative dentries by recording the generation number of the parent
      directory in d_fsdata. The generation number for the parent directory is
      recorded in the inode_info, which increments every time the lock on the
      directory is dropped.
      
      If the generation number of the parent directory and the negative dentry
      matches, there is no need to perform the revalidate, else a revalidate
      is forced. This improves performance in situations where nodes look for
      the same non-existent file multiple times.
      
      Thanks Mark for explaining the DLM sequence.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5e98d492
  2. 10 9月, 2010 13 次提交
    • T
      ocfs2: Cache system inodes of other slots. · b4d693fc
      Tao Ma 提交于
      Durring orphan scan, if we are slot 0, and we are replaying
      orphan_dir:0001, the general process is that for every file
      in this dir:
      1. we will iget orphan_dir:0001, since there is no inode for it.
         we will have to create an inode and read it from the disk.
      2. do the normal work, such as delete_inode and remove it from
         the dir if it is allowed.
      3. call iput orphan_dir:0001 when we are done. In this case,
         since we have no dcache for this inode, i_count will
         reach 0, and VFS will have to call clear_inode and in
         ocfs2_clear_inode we will checkpoint the inode which will let
         ocfs2_cmt and journald begin to work.
      4. We loop back to 1 for the next file.
      
      So you see, actually for every deleted file, we have to read the
      orphan dir from the disk and checkpoint the journal. It is very
      time consuming and cause a lot of journal checkpoint I/O.
      A better solution is that we can have another reference for these
      inodes in ocfs2_super. So if there is no other race among
      nodes(which will let dlmglue to checkpoint the inode), for step 3,
      clear_inode won't be called and for step 1, we may only need to
      read the inode for the 1st time. This is a big win for us.
      
      So this patch will try to cache system inodes of other slots so
      that we will have one more reference for these inodes and avoid
      the extra inode read and journal checkpoint.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b4d693fc
    • J
      libfs: Fix shift bug in generic_check_addressable() · a33f13ef
      Joel Becker 提交于
      generic_check_addressable() erroneously shifts pages down by a block
      factor when it should be shifting up.  To prevent overflow, we shift
      blocks down to pages.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a33f13ef
    • P
      OCFS2: Allow huge (> 16 TiB) volumes to mount · 3bdb8efd
      Patrick J. LoPresti 提交于
      The OCFS2 developers have already done all of the hard work to allow
      volumes larger than 16 TiB.  But there is still a "sanity check" in
      fs/ocfs2/super.c that prevents the mounting of such volumes, even when
      the cluster size and journal options would allow it.
      
      This patch replaces that sanity check with a more sophisticated one to
      mount a huge volume provided that (a) it is addressable by the raw
      word/address size of the system (borrowing a test from ext4); (b) the
      volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
      set on the journal.
      
      I factored out the sanity check into its own function.  I also moved it
      from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
      and the journal will not have been initialized yet.
      
      This patch is one of a pair, and it depends on the other ("JBD2: Allow
      feature checks before journal recovery").
      
      I have tested this patch on small volumes, huge volumes, and huge
      volumes without 64-bit block support in the journal.  All of them appear
      to work or to fail gracefully, as appropriate.
      Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      3bdb8efd
    • P
      JBD2: Allow feature checks before journal recovery · 1113e1b5
      Patrick J. LoPresti 提交于
      Before we start accessing a huge (> 16 TiB) OCFS2 volume, we need to
      confirm that its journal supports 64-bit offsets.  In particular, we
      need to check the journal's feature bits before recovering the journal.
      
      This is not possible with JBD2 at present, because the journal
      superblock (where the feature bits reside) is not loaded from disk until
      the journal is recovered.
      
      This patch loads the journal superblock in
      jbd2_journal_check_used_features() if it has not already been loaded,
      allowing us to check the feature bits before journal recovery.
      Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
      Cc: linux-ext4@vger.kernel.org
      Acked-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      1113e1b5
    • P
      ext3/ext4: Factor out disk addressability check · 30ca22c7
      Patrick J. LoPresti 提交于
      As part of adding support for OCFS2 to mount huge volumes, we need to
      check that the sector_t and page cache of the system are capable of
      addressing the entire volume.
      
      An identical check already appears in ext3 and ext4.  This patch moves
      the addressability check into its own function in fs/libfs.c and
      modifies ext3 and ext4 to invoke it.
      
      [Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
      Signed-off-by: NPatrick LoPresti <lopresti@gmail.com>
      Cc: linux-ext4@vger.kernel.org
      Acked-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      30ca22c7
    • T
      ocfs2: Remove obsolete comments before ocfs2_start_trans. · 17ae5211
      Tao Ma 提交于
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      17ae5211
    • T
      ocfs2: Remove unused old_id in ocfs2_commit_cache. · f9c57ada
      Tao Ma 提交于
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      f9c57ada
    • J
      ocfs2: Remove ocfs2_sync_inode() · 4c38881f
      Jan Kara 提交于
      ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
      already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
      inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
      is superfluous as well.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      4c38881f
    • G
      Reorganize data elements to reduce struct sizes · 83fd9c7f
      Goldwyn Rodrigues 提交于
      Thanks for the comments. I have incorportated them all.
      
      CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
      Statistics now look like -
      ocfs2_write_ctxt: 2144 - 2136 = 8
      ocfs2_inode_info: 1960 - 1848 = 112
      ocfs2_journal: 168 - 160 = 8
      ocfs2_lock_res: 336 - 304 = 32
      ocfs2_refcount_tree: 512 - 472 = 40
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83fd9c7f
    • T
      ocfs2: Remove obscure error handling in direct_write. · 95fa859a
      Tao Ma 提交于
      In ocfs2, actually we don't allow any direct write pass i_size,
      see the function ocfs2_prepare_inode_for_write. So we don't
      need the bogus simple_setsize.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      95fa859a
    • T
      ocfs2: Add some trace log for orphan scan. · 3c3f20c9
      Tao Ma 提交于
      Now orphan scan worker has no trace log, so it is
      very hard to tell whether it is finished or blocked.
      So add 2 mlog trace log so that we can tell whether
      the current orphan scan worker is blocked or not.
      It does help when I analyzed a orphan scan bug.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      3c3f20c9
    • T
      Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8. · ddee5cdb
      Tristan Ye 提交于
      The reason why we need this ioctl is to offer the none-privileged
      end-user a possibility to get filesys info gathering.
      
      We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
      structure to kernel containing an array of request pointers and request
      count, such as,
      
      * From userspace:
      
      struct ocfs2_info_blocksize oib = {
              .ib_req = {
                      .ir_magic = OCFS2_INFO_MAGIC,
                      .ir_code = OCFS2_INFO_BLOCKSIZE,
                      ...
              }
              ...
      }
      
      struct ocfs2_info_clustersize oic = {
              ...
      }
      
      uint64_t reqs[2] = {(unsigned long)&oib,
                          (unsigned long)&oic};
      
      struct ocfs2_info info = {
              .oi_requests = reqs,
              .oi_count = 2,
      }
      
      ret = ioctl(fd, OCFS2_IOC_INFO, &info);
      
      * In kernel:
      
      Get the request pointers from *info*, then handle each request one bye one.
      
      Idea here is to make the spearated request small enough to guarantee
      a better backward&forward compatibility since a small piece of request
      would be less likely to be broken if filesys on raw disk get changed.
      
      Currently, the following 7 requests are supported per the requirement from
      userspace tool o2info, and I believe it will grow over time:-)
      
              OCFS2_INFO_CLUSTERSIZE
              OCFS2_INFO_BLOCKSIZE
              OCFS2_INFO_MAXSLOTS
              OCFS2_INFO_LABEL
              OCFS2_INFO_UUID
              OCFS2_INFO_FS_FEATURES
              OCFS2_INFO_JOURNAL_SIZE
      
      This ioctl is only specific to OCFS2.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      ddee5cdb
    • S
      mm: Move vma_stack_continue into mm.h · 39aa3cb3
      Stefan Bader 提交于
      So it can be used by all that need to check for that.
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39aa3cb3
  3. 08 9月, 2010 14 次提交
  4. 07 9月, 2010 2 次提交
    • M
      fuse: fix lock annotations · b9ca67b2
      Miklos Szeredi 提交于
      Sparse doesn't understand lock annotations of the form
      __releases(&foo->lock).  Change them to __releases(foo->lock).  Same
      for __acquires().
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      b9ca67b2
    • M
      fuse: flush background queue on connection close · 595afaf9
      Miklos Szeredi 提交于
      David Bartly reported that fuse can hang in fuse_get_req_nofail() when
      the connection to the filesystem server is no longer active.
      
      If bg_queue is not empty then flush_bg_queue() called from
      request_end() can put more requests on to the pending queue.  If this
      happens while ending requests on the processing queue then those
      background requests will be queued to the pending list and never
      ended.
      
      Another problem is that fuse_dev_release() didn't wake up processes
      sleeping on blocked_waitq.
      
      Solve this by:
      
       a) flushing the background queue before calling end_requests() on the
          pending and processing queues
      
       b) setting blocked = 0 and waking up processes waiting on
          blocked_waitq()
      
      Thanks to David for an excellent bug report.
      Reported-by: NDavid Bartley <andareed@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      595afaf9
  5. 04 9月, 2010 1 次提交
  6. 03 9月, 2010 3 次提交
    • T
      xfs: Make fiemap work with sparse files · 9af25465
      Tao Ma 提交于
      In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
      to return fi_extent_max extents, but actually it won't work for
      a sparse file. The reason is that in xfs_getbmap we will
      calculate holes and set it in 'out', while out is malloced by
      bmv_count(fi_extent_max+1) which didn't consider holes. So in the
      worst case, if 'out' vector looks like
      [hole, extent, hole, extent, hole, ... hole, extent, hole],
      we will only return half of fi_extent_max extents.
      
      This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
      So with this flags, we don't use our 'out' in xfs_getbmap for
      a hole. The solution is a bit ugly by just don't increasing
      index of 'out' vector. I felt that it is not easy to skip it
      at the very beginning since we have the complicated check and
      some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9af25465
    • D
      xfs: prevent 32bit overflow in space reservation · 72656c46
      Dave Chinner 提交于
      If we attempt to preallocate more than 2^32 blocks of space in a
      single syscall, the transaction block reservation will overflow
      leading to a hangs in the superblock block accounting code. This
      is trivially reproduced with xfs_io. Fix the problem by capping the
      allocation reservation to the maximum number of blocks a single
      xfs_bmapi() call can allocate (2^21 blocks).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      72656c46
    • J
      nfsd4: mask out non-access bits in nfs4_access_to_omode · 8f34a430
      J. Bruce Fields 提交于
      This fixes an unnecessary BUG().
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      8f34a430
  7. 02 9月, 2010 2 次提交
    • A
      xfs: Disallow 32bit project quota id · 23963e54
      Arkadiusz Mi?kiewicz 提交于
      Currently on-disk structure is able to keep only 16bit project quota
      id, so disallow 32bit ones. This fixes a problem where parts of
      kernel structures holding project quota id are 32bit while parts
      (on-disk) are 16bit variables which causes project quota member
      files to be inaccessible for some operations (like mv/rm).
      Signed-off-by: NArkadiusz Mi?kiewicz <arekm@maven.pl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      23963e54
    • D
      xfs: improve buffer cache hash scalability · 9bc08a45
      Dave Chinner 提交于
      When doing large parallel file creates on a 16p machines, large amounts of
      time is being spent in _xfs_buf_find(). A system wide profile with perf top
      shows this:
      
                1134740.00 19.3% _xfs_buf_find
                 733142.00 12.5% __ticket_spin_lock
      
      The problem is that the hash contains 45,000 buffers, and the hash table width
      is only 256 buffers. That means we've got around 200 buffers per chain, and
      searching it is quite expensive. The hash table size needs to increase.
      
      Secondly, every time we do a lookup, we promote the buffer we find to the head
      of the hash chain. This is causing cachelines to be dirtied and causes
      invalidation of cachelines across all CPUs that may have walked the hash chain
      recently. hence every walk of the hash chain is effectively a cold cache walk.
      Remove the promotion to avoid this invalidation.
      
      The results are:
      
                1045043.00 21.2% __ticket_spin_lock
                 326184.00  6.6% _xfs_buf_find
      
      A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
      not result in an increase in performance underthis workload as contention on
      the inode_lock soaks up most of the reduction in CPU usage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9bc08a45
  8. 30 8月, 2010 2 次提交
  9. 28 8月, 2010 2 次提交
    • E
      fsnotify: drop two useless bools in the fnsotify main loop · 92b4678e
      Eric Paris 提交于
      The fsnotify main loop has 2 bools which indicated if we processed the
      inode or vfsmount mark in that particular pass through the loop.  These
      bool can we replaced with the inode_group and vfsmount_group variables
      and actually make the code a little easier to understand.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      92b4678e
    • E
      fsnotify: fix list walk order · f72adfd5
      Eric Paris 提交于
      Marks were stored on the inode and vfsmonut mark list in order from
      highest memory address to lowest memory address.  The code to walk those
      lists thought they were in order from lowest to highest with
      unpredictable results when trying to match up marks from each.  It was
      possible that extra events would be sent to userspace when inode
      marks ignoring events wouldn't get matched with the vfsmount marks.
      
      This problem only affected fanotify when using both vfsmount and inode
      marks simultaneously.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      f72adfd5