1. 29 4月, 2014 1 次提交
    • Y
      ceph: clear directory's completeness when creating file · 0a8a70f9
      Yan, Zheng 提交于
      When creating a file, ceph_set_dentry_offset() puts the new dentry
      at the end of directory's d_subdirs, then set the dentry's offset
      based on directory's max offset. The offset does not reflect the
      real postion of the dentry in directory. Later readdir reply from
      MDS may change the dentry's position/offset. This inconsistency
      can cause missing/duplicate entries in readdir result if readdir
      is partly satisfied by dcache_readdir().
      
      The fix is clear directory's completeness after creating/renaming
      file. It prevents later readdir from using dcache_readdir().
      
      Fixes: http://tracker.ceph.com/issues/8025Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      0a8a70f9
  2. 05 4月, 2014 1 次提交
    • Y
      ceph: use fl->fl_file as owner identifier of flock and posix lock · eb13e832
      Yan, Zheng 提交于
      flock and posix lock should use fl->fl_file instead of process ID
      as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
      is usually equal to fl->fl_file, but it also can be a customized
      value). The process ID of who holds the lock is just for F_GETLK
      fcntl(2).
      
      The fix is rename the 'pid' fields of struct ceph_mds_request_args
      and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
      to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
      We also set the most significant bit of the 'owner' field. MDS can
      use that bit to distinguish between old and new clients.
      
      The MDS counterpart of this patch modifies the flock code to not
      take the 'pid_namespace' into consideration when checking conflict
      locks.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      eb13e832
  3. 03 4月, 2014 1 次提交
    • Y
      ceph: fix ceph_dir_llseek() · f0494206
      Yan, Zheng 提交于
      Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
      directory. For a fragmented directory, offset (frag_t, off) can be
      larger than inode->i_sb->s_maxbytes.
      
      At the very beginning of ceph_dir_llseek(), local variable old_offset
      is initialized to parameter offset. This doesn't make sense neither.
      Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      f0494206
  4. 18 2月, 2014 1 次提交
  5. 31 1月, 2014 1 次提交
  6. 30 1月, 2014 1 次提交
  7. 29 1月, 2014 1 次提交
    • L
      ceph: Fix up after semantic merge conflict · 4db658ea
      Linus Torvalds 提交于
      The previous ceph-client merge resulted in ceph not even building,
      because there was a merge conflict that wasn't visible as an actual data
      conflict: commit 7221fe4c ("ceph: add acl for cephfs") added support
      for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
      a lot of the POSIX ACL helper functions to be much more helpful to
      filesystems (see for example commits 2aeccbe9 "fs: add generic
      xattr_acl handlers", 5bf3258f "fs: make posix_acl_chmod more useful"
      and 37bc1539 "fs: make posix_acl_create more useful")
      
      The reason this conflict wasn't obvious was many-fold: because it was a
      semantic conflict rather than a data conflict, it wasn't visible in the
      git merge as a conflict.  And because the VFS tree hadn't been in
      linux-next, people hadn't become aware of it that way.  And because I
      was at jury duty this morning, I was using my laptop and as a result not
      doing constant "allmodconfig" builds.
      
      Anyway, this fixes the build and generally removes a fair chunk of the
      Ceph POSIX ACL support code, since the improved helpers seem to match
      really well for Ceph too.  But I don't actually have any way to *test*
      the end result, and I was really hoping for some ACK's for this.  Oh,
      well.
      
      Not compiling certainly doesn't make things easier to test, so I'm
      committing this without the acks after having waited for four hours...
      Plus it's what I would have done for the merge had I noticed the
      semantic conflict..
      Reported-by: NDave Jones <davej@redhat.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Guangliang Zhao <lucienchao@gmail.com>
      Cc: Li Wang <li.wang@ubuntykylin.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db658ea
  8. 21 1月, 2014 3 次提交
    • Y
      ceph: add imported caps when handling cap export message · 11df2dfb
      Yan, Zheng 提交于
      Version 3 cap export message includes information about the imported
      caps. It allows us to add the imported caps if the corresponding cap
      import message still hasn't been received.
      
      This allow us to handle situation that the importer MDS crashes and
      the cap import message is missing.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      11df2dfb
    • Y
      ceph: check inode caps in ceph_d_revalidate · 9215aeea
      Yan, Zheng 提交于
      Some inodes in readdir reply may have no caps. Getattr mds request
      for these inodes can return -ESTALE. The fix is consider dentry that
      links to inode with no caps as invalid. Invalid dentry causes a
      lookup request to send to the mds, the MDS will send caps back.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      9215aeea
    • Y
      ceph: fix cache revoke race · 9563f88c
      Yan, Zheng 提交于
      handle following sequence of events:
      
      - non-auth MDS revokes Fc cap. queue invalidate work
      - auth MDS issues Fc cap through request reply. i_rdcache_gen gets
        increased.
      - invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
        so it does nothing.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      9563f88c
  9. 01 1月, 2014 1 次提交
  10. 14 12月, 2013 1 次提交
  11. 24 11月, 2013 1 次提交
  12. 07 9月, 2013 2 次提交
  13. 16 8月, 2013 1 次提交
    • Y
      ceph: introduce i_truncate_mutex · b0d7c223
      Yan, Zheng 提交于
      I encountered below deadlock when running fsstress
      
      wmtruncate work      truncate                 MDS
      ---------------  ------------------  --------------------------
                         lock i_mutex
                                            <- truncate file
      lock i_mutex (blocked)
                                            <- revoking Fcb (filelock to MIX)
                         send request ->
                                               handle request (xlock filelock)
      
      At the initial time, there are some dirty pages in the page cache.
      When the kclient receives the truncate message, it reduces inode size
      and creates some 'out of i_size' dirty pages. wmtruncate work can't
      truncate these dirty pages because it's blocked by the i_mutex. Later
      when the kclient receives the cap message that revokes Fcb caps, It
      can't flush all dirty pages because writepages() only flushes dirty
      pages within the inode size.
      
      When the MDS handles the 'truncate' request from kclient, it waits
      for the filelock to become stable. But the filelock is stuck in
      unstable state because it can't finish revoking kclient's Fcb caps.
      
      The truncate pagecache locking has already caused lots of trouble
      for use. I think it's time simplify it by introducing a new mutex.
      We use the new mutex to prevent concurrent truncate_inode_pages().
      There is no need to worry about race between buffered write and
      truncate_inode_pages(), because our "get caps" mechanism prevents
      them from concurrent execution.
      Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      b0d7c223
  14. 10 8月, 2013 1 次提交
    • Y
      ceph: fix freeing inode vs removing session caps race · 6f60f889
      Yan, Zheng 提交于
      remove_session_caps() uses iterate_session_caps() to remove caps,
      but iterate_session_caps() skips inodes that are being deleted.
      So session->s_nr_caps can be non-zero after iterate_session_caps()
      return.
      
      We can fix the issue by waiting until deletions are complete.
      __wait_on_freeing_inode() is designed for the job, but it is not
      exported, so we use lookup inode function to access it.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      6f60f889
  15. 04 7月, 2013 2 次提交
  16. 18 5月, 2013 1 次提交
    • J
      ceph: ceph_pagelist_append might sleep while atomic · 39be95e9
      Jim Schutt 提交于
      Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
      while holding a lock, but it's spoiled because ceph_pagelist_addpage()
      always calls kmap(), which might sleep.  Here's the result:
      
      [13439.295457] ceph: mds0 reconnect start
      [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
      [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
          . . .
      [13439.376225] Call Trace:
      [13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
      [13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
      [13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
      [13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
      [13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
      [13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
      [13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
      [13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
      [13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
      [13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
      [13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
      [13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
      [13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
      [13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
      [13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
      [13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
      [13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
      [13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
      [13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
          . . .
      [13439.587132] ceph: mds0 reconnect success
      [13490.720032] ceph: mds0 caps stale
      [13501.235257] ceph: mds0 recovery completed
      [13501.300419] ceph: mds0 caps renewed
      
      Fix it up by encoding locks into a buffer first, and when the number
      of encoded locks is stable, copy that into a ceph_pagelist.
      
      [elder@inktank.com: abbreviated the stack info a bit.]
      
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      39be95e9
  17. 02 5月, 2013 4 次提交
  18. 23 2月, 2013 1 次提交
  19. 20 2月, 2013 1 次提交
  20. 12 2月, 2013 1 次提交
    • E
      ceph: Translate between uid and gids in cap messages and kuids and kgids · 05cb11c1
      Eric W. Biederman 提交于
      - Make the uid and gid arguments of send_cap_msg() used to compose
        ceph_mds_caps messages of type kuid_t and kgid_t.
      
      - Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
        through variables of type kuid_t and kgid_t.
      
      - Modify struct ceph_cap_snap to store uids and gids in types kuid_t
        and kgid_t.  This allows capturing inode->i_uid and inode->i_gid in
        ceph_queue_cap_snap() without loss and pssing them to
        __ceph_flush_snaps() where they are removed from struct
        ceph_cap_snap and passed to send_cap_msg().
      
      - In handle_cap_grant translate uid and gids in the initial user
        namespace stored in struct ceph_mds_cap into kuids and kgids
        before setting inode->i_uid and inode->i_gid.
      
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      05cb11c1
  21. 03 8月, 2012 1 次提交
    • S
      ceph: simplify+fix atomic_open · 5ef50c3b
      Sage Weil 提交于
      The initial ->atomic_open op was carried over from the old intent code,
      which was incomplete and didn't really work.  Replace it with a fresh
      method.  In particular:
      
       * always attempt to do an atomic open+lookup, both for the create case
         and for lookups of existing files.
       * fix symlink handling by returning 1 to the VFS so that we can follow
         the link to its destination. This fixes a longstanding ceph bug (#2392).
      Signed-off-by: NSage Weil <sage@inktank.com>
      5ef50c3b
  22. 31 7月, 2012 1 次提交
    • A
      ceph: define snap counts as u32 everywhere · aa711ee3
      Alex Elder 提交于
      There are two structures in which a count of snapshots are
      maintained:
      
          struct ceph_snap_context {
      	...
              u32 num_snaps;
      	...
          }
      and
          struct ceph_snap_realm {
      	...
              u32 num_prior_parent_snaps;   /*  had prior to parent_since */
      	...
              u32 num_snaps;
      	...
          }
      
      These fields never take on negative values (e.g., to hold special
      meaning), and so are really inherently unsigned.  Furthermore they
      take their value from over-the-wire or on-disk formatted 32-bit
      values.
      
      So change their definition to have type u32, and change some spots
      elsewhere in the code to account for this change.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      aa711ee3
  23. 14 7月, 2012 5 次提交
  24. 22 3月, 2012 2 次提交
  25. 13 1月, 2012 1 次提交
  26. 04 1月, 2012 1 次提交
  27. 08 12月, 2011 1 次提交
    • S
      ceph: use i_ceph_lock instead of i_lock · be655596
      Sage Weil 提交于
      We have been using i_lock to protect all kinds of data structures in the
      ceph_inode_info struct, including lists of inodes that we need to iterate
      over while avoiding races with inode destruction.  That requires grabbing
      a reference to the inode with the list lock protected, but igrab() now
      takes i_lock to check the inode flags.
      
      Changing the list lock ordering would be a painful process.
      
      However, using a ceph-specific i_ceph_lock in the ceph inode instead of
      i_lock is a simple mechanical change and avoids the ordering constraints
      imposed by igrab().
      Reported-by: NAmon Ott <a.ott@m-privacy.de>
      Signed-off-by: NSage Weil <sage@newdream.net>
      be655596
  28. 06 11月, 2011 1 次提交