1. 08 12月, 2011 1 次提交
    • S
      ceph: use i_ceph_lock instead of i_lock · be655596
      Sage Weil 提交于
      We have been using i_lock to protect all kinds of data structures in the
      ceph_inode_info struct, including lists of inodes that we need to iterate
      over while avoiding races with inode destruction.  That requires grabbing
      a reference to the inode with the list lock protected, but igrab() now
      takes i_lock to check the inode flags.
      
      Changing the list lock ordering would be a painful process.
      
      However, using a ceph-specific i_ceph_lock in the ceph inode instead of
      i_lock is a simple mechanical change and avoids the ordering constraints
      imposed by igrab().
      Reported-by: NAmon Ott <a.ott@m-privacy.de>
      Signed-off-by: NSage Weil <sage@newdream.net>
      be655596
  2. 06 11月, 2011 2 次提交
  3. 02 11月, 2011 1 次提交
  4. 26 10月, 2011 1 次提交
    • S
      Revert "ceph: don't truncate dirty pages in invalidate work thread" · 83eaea22
      Sage Weil 提交于
      This reverts commit c9af9fb6.
      
      We need to block and truncate all pages in order to reliably invalidate
      them.  Otherwise, we could:
      
       - have some uptodate pages in the cache
       - queue an invalidate
       - write(2) locks some pages
       - invalidate_work skips them
       - write(2) only overwrites part of the page
       - page now dirty and uptodate
       -> partial leakage of invalidated data
      
      It's not entirely clear why we started skipping locked pages in the first
      place.  I just ran this through fsx and didn't see any problems.
      Signed-off-by: NSage Weil <sage@newdream.net>
      83eaea22
  5. 27 7月, 2011 4 次提交
  6. 20 7月, 2011 3 次提交
  7. 08 6月, 2011 1 次提交
  8. 12 5月, 2011 1 次提交
  9. 05 5月, 2011 1 次提交
  10. 22 3月, 2011 1 次提交
  11. 16 3月, 2011 1 次提交
    • S
      ceph: preserve I_COMPLETE across rename · 09adc80c
      Sage Weil 提交于
      d_move puts the renamed dentry at the end of d_subdirs, screwing with our
      cached dentry directory offsets.  We were just clearing I_COMPLETE to avoid
      any possibility of trouble.  However, assigning the renamed dentry an
      offset at the end of the directory (to match it's new d_subdirs position)
      is sufficient to maintain correct behavior and hold onto I_COMPLETE.
      
      This is especially important for workloads like rsync, which renames files
      into place.  Before, we would lose I_COMPLETE and do MDS lookups for each
      file.  With this patch we only talk to the MDS on create and rename.
      Signed-off-by: NSage Weil <sage@newdream.net>
      09adc80c
  12. 04 3月, 2011 1 次提交
  13. 14 1月, 2011 1 次提交
  14. 13 1月, 2011 2 次提交
    • S
      ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS · 14303d20
      Sage Weil 提交于
      This implements the DIRLAYOUTHASH protocol feature, which passes the dir
      layout over the wire from the MDS.  This gives the client knowledge
      of the correct hash function to use for mapping dentries among dir
      fragments.
      
      Note that if this feature is _not_ present on the client but is on the
      MDS, the client may misdirect requests.  This will result in a forward
      and degrade performance.  It may also result in inaccurate NFS filehandle
      generation, which will prevent fh resolution when the inode is not present
      in the client cache and the parent directories have been fragmented.
      Signed-off-by: NSage Weil <sage@newdream.net>
      14303d20
    • S
      ceph: add dir_layout to inode · 6c0f3af7
      Sage Weil 提交于
      Add a ceph_dir_layout to the inode, and calculate dentry hash values based
      on the parent directory's specified dir_hash function.  This is needed
      because the old default Linux dcache hash function is extremely week and
      leads to a poor distribution of files among dir fragments.
      Signed-off-by: NSage Weil <sage@newdream.net>
      6c0f3af7
  15. 07 1月, 2011 5 次提交
    • N
      fs: provide rcu-walk aware permission i_ops · b74c79e9
      Nick Piggin 提交于
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b74c79e9
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: dcache scale subdirs · 2fd6b7f5
      Nick Piggin 提交于
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • N
      fs: dcache scale dentry refcount · b7ab39f6
      Nick Piggin 提交于
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b7ab39f6
  16. 18 11月, 2010 1 次提交
  17. 10 11月, 2010 1 次提交
    • S
      ceph: make page alignment explicit in osd interface · b7495fc2
      Sage Weil 提交于
      We used to infer alignment of IOs within a page based on the file offset,
      which assumed they matched.  This broke with direct IO that was not aligned
      to pages (e.g., 512-byte aligned IO).  We were also trusting the alignment
      specified in the OSD reply, which could have been adjusted by the server.
      
      Explicitly specify the page alignment when setting up OSD IO requests.
      Signed-off-by: NSage Weil <sage@newdream.net>
      b7495fc2
  18. 09 11月, 2010 2 次提交
    • S
      ceph: fix update of ctime from MDS · d8672d64
      Sage Weil 提交于
      The client can have a newer ctime than the MDS due to AUTH_EXCL and
      XATTR_EXCL caps as well; update the check in ceph_fill_file_time
      appropriately.
      
      This fixes cases where ctime/mtime goes backward under the right sequence
      of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
      goes to the MDS).
      Signed-off-by: NSage Weil <sage@newdream.net>
      d8672d64
    • S
      ceph: fix version check on racing inode updates · 8bd59e01
      Sage Weil 提交于
      We may get updates on the same inode from multiple MDSs; generally we only
      pay attention if the update is newer than what we already have.  The
      exception is when an MDS sense unstable information, in which case we
      always update.
      
      The old > check got this wrong when our version was odd (e.g. 3) and the
      reply version was even (e.g. 2): the older stale (v2) info would be
      applied.  Fixed and clarified the comment.
      Signed-off-by: NSage Weil <sage@newdream.net>
      8bd59e01
  19. 08 11月, 2010 3 次提交
    • S
      ceph: fix rdcache_gen usage and invalidate · cd045cb4
      Sage Weil 提交于
      We used to use rdcache_gen to indicate whether we "might" have cached
      pages.  Now we just look at the mapping to determine that.  However, some
      old behavior remains from that transition.
      
      First, rdcache_gen == 0 no longer means we have no pages.  That can happen
      at any time (presumably when we carry FILE_CACHE).  We should not reset it
      to zero, and we should not check that it is zero.
      
      That means that the only purpose for rdcache_revoking is to resolve races
      between new issues of FILE_CACHE and an async invalidate.  If they are
      equal, we should invalidate.  On success, we decrement rdcache_revoking,
      so that it is no longer equal to rdcache_gen.  Similarly, if we success
      in doing a sync invalidate, set revoking = gen - 1.  (This is a small
      optimization to avoid doing unnecessary invalidate work and does not
      affect correctness.)
      Signed-off-by: NSage Weil <sage@newdream.net>
      cd045cb4
    • S
      ceph: only let auth caps update max_size · 912a9b03
      Sage Weil 提交于
      Only the auth MDS has a meaningful max_size value for us, so only update it
      in fill_inode if we're being issued an auth cap.  Otherwise, a random
      stat result from a non-auth MDS can clobber a meaningful max_size, get
      the client<->mds cap state out of sync, and make writes hang.
      
      Specifically, even if the client re-requests a larger max_size (which it
      will), the MDS won't respond because as far as it knows we already have a
      sufficiently large value.
      Signed-off-by: NSage Weil <sage@newdream.net>
      912a9b03
    • S
      ceph: fix bad pointer dereference in ceph_fill_trace · d8b16b3d
      Sage Weil 提交于
      We dereference *in a few lines down, but only set it on rename.  It is
      apparently pretty rare for this to trigger, but I have been hitting it
      with a clustered MDSs.
      Signed-off-by: NSage Weil <sage@newdream.net>
      d8b16b3d
  20. 21 10月, 2010 1 次提交
    • Y
      ceph: factor out libceph from Ceph file system · 3d14c5d2
      Yehuda Sadeh 提交于
      This factors out protocol and low-level storage parts of ceph into a
      separate libceph module living in net/ceph and include/linux/ceph.  This
      is mostly a matter of moving files around.  However, a few key pieces
      of the interface change as well:
      
       - ceph_client becomes ceph_fs_client and ceph_client, where the latter
         captures the mon and osd clients, and the fs_client gets the mds client
         and file system specific pieces.
       - Mount option parsing and debugfs setup is correspondingly broken into
         two pieces.
       - The mon client gets a generic handler callback for otherwise unknown
         messages (mds map, in this case).
       - The basic supported/required feature bits can be expanded (and are by
         ceph_fs_client).
      
      No functional change, aside from some subtle error handling cases that got
      cleaned up in the refactoring process.
      Signed-off-by: NSage Weil <sage@newdream.net>
      3d14c5d2
  21. 14 9月, 2010 1 次提交
  22. 26 8月, 2010 1 次提交
  23. 23 8月, 2010 1 次提交
  24. 02 8月, 2010 1 次提交
  25. 28 7月, 2010 1 次提交
  26. 24 7月, 2010 1 次提交