1. 26 5月, 2016 5 次提交
    • I
      libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe
      Ilya Dryomov 提交于
      finish_read(), its only user, uses it to get to hdr.data_len, which is
      what ->r_result is set to on success.  This gains us the ability to
      safely call callbacks from contexts other than reply, e.g. map check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      85e084fe
    • I
      libceph: switch to calc_target(), part 2 · bb873b53
      Ilya Dryomov 提交于
      The crux of this is getting rid of ceph_osdc_build_request(), so that
      MOSDOp can be encoded not before but after calc_target() calculates the
      actual target.  Encoding now happens within ceph_osdc_start_request().
      
      Also nuked is the accompanying bunch of pointers into the encoded
      buffer that was used to update fields on each send - instead, the
      entire front is re-encoded.  If we want to support target->name_len !=
      base->name_len in the future, there is no other way, because oid is
      surrounded by other fields in the encoded buffer.
      
      Encoding OSD ops and adding data items to the request message were
      mixed together in osd_req_encode_op().  While we want to re-encode OSD
      ops, we don't want to add duplicate data items to the message when
      resending, so all call to ceph_osdc_msg_data_add() are factored out
      into a new setup_request_data().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bb873b53
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: variable-sized ceph_object_id · d30291b9
      Ilya Dryomov 提交于
      Currently ceph_object_id can hold object names of up to 100
      (CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
      expect one - long rbd image names:
      
      - a format 1 header is named "<imgname>.rbd"
      - an object that points to a format 2 header is named "rbd_id.<imgname>"
      
      We operate on these potentially long-named objects during rbd map, and,
      for format 1 images, during header refresh.  (A format 2 header name is
      a small system-generated string.)
      
      Lift this 100 character limit by making ceph_object_id be able to point
      to an externally-allocated string.  Apart from being able to work with
      almost arbitrarily-long named objects, this allows us to reduce the
      size of ceph_object_id from >100 bytes to 64 bytes.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d30291b9
    • I
      libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16
      Ilya Dryomov 提交于
      The size of ->r_request and ->r_reply messages depends on the size of
      the object name (ceph_object_id), while the size of ceph_osd_request is
      fixed.  Move message allocation into a separate function that would
      have to be called after ceph_object_id and ceph_object_locator (which
      is also going to become variable in size with RADOS namespaces) have
      been filled in:
      
          req = ceph_osdc_alloc_request(...);
          <fill in req->r_base_oid>
          <fill in req->r_base_oloc>
          ceph_osdc_alloc_messages(req);
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      13d1ad16
  2. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 26 3月, 2016 4 次提交
    • G
      ceph: use kmem_cache_zalloc · 99ec2697
      Geliang Tang 提交于
      Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      99ec2697
    • Y
      ceph: fix security xattr deadlock · 315f2408
      Yan, Zheng 提交于
      When security is enabled, security module can call filesystem's
      getxattr/setxattr callbacks during d_instantiate(). For cephfs,
      d_instantiate() is usually called by MDS' dispatch thread, while
      handling MDS reply. If the MDS reply does not include xattrs and
      corresponding caps, getxattr/setxattr need to send a new request
      to MDS and waits for the reply. This makes MDS' dispatch sleep,
      nobody handles later MDS replies.
      
      The fix is make sure lookup/atomic_open reply include xattrs and
      corresponding caps. So getxattr can be handled by cached xattrs.
      This requires some modification to both MDS and request message.
      (Client tells MDS what caps it wants; MDS encodes proper caps in
      the reply)
      
      Smack security module may call setxattr during d_instantiate().
      Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
      to us. So just make setxattr return error when called by MDS'
      dispatch thread.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      315f2408
    • D
      ceph: replace CURRENT_TIME by current_fs_time() · 8bbd4714
      Deepa Dinamani 提交于
      CURRENT_TIME macro is not appropriate for filesystems as it
      doesn't use the right granularity for filesystem timestamps.
      Use current_fs_time() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      8bbd4714
    • Y
      ceph: remove useless BUG_ON · a587d71b
      Yan, Zheng 提交于
      ceph_osdc_start_request() never return -EOLDSNAP
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      a587d71b
  4. 05 2月, 2016 2 次提交
  5. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  6. 22 1月, 2016 3 次提交
    • Y
      ceph: use i_size_{read,write} to get/set i_size · 99c88e69
      Yan, Zheng 提交于
      Cap message from MDS can update i_size. In that case, we don't
      hold i_mutex. So it's unsafe to directly access inode->i_size
      while holding i_mutex.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      99c88e69
    • Y
      ceph: re-send AIO write request when getting -EOLDSNAP error · 5be0389d
      Yan, Zheng 提交于
      When receiving -EOLDSNAP from OSD, we need to re-send corresponding
      write request. Due to locking issue, we can send new request inside
      another OSD request's complete callback. So we use worker to re-send
      request for AIO write.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      5be0389d
    • Y
      ceph: Asynchronous IO support · c8fe9b17
      Yan, Zheng 提交于
      The basic idea of AIO support is simple, just call kiocb::ki_complete()
      in OSD request's complete callback. But there are several special cases.
      
      when IO span multiple objects, we need to wait until all OSD requests
      are complete, then call kiocb::ki_complete(). Error handling in this case
      is tricky too. For simplify, AIO both span multiple objects and extends
      i_size are not allowed.
      
      Another special case is check EOF for reading (other client can write to
      the file and extend i_size concurrently). For simplify, the direct-IO/AIO
      code path does do the check, fallback to normal syn read instead.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      c8fe9b17
  7. 03 11月, 2015 1 次提交
  8. 09 9月, 2015 3 次提交
  9. 25 6月, 2015 5 次提交
    • Y
      ceph: rework dcache readdir · fdd4e158
      Yan, Zheng 提交于
      Previously our dcache readdir code relies on that child dentries in
      directory dentry's d_subdir list are sorted by dentry's offset in
      descending order. When adding dentries to the dcache, if a dentry
      already exists, our readdir code moves it to head of directory
      dentry's d_subdir list. This design relies on dcache internals.
      Al Viro suggests using ncpfs's approach: keeping array of pointers
      to dentries in page cache of directory inode. the validity of those
      pointers are presented by directory inode's complete and ordered
      flags. When a dentry gets pruned, we clear directory inode's complete
      flag in the d_prune() callback. Before moving a dentry to other
      directory, we clear the ordered flag for both old and new directory.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      fdd4e158
    • Y
      ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL · 687265e5
      Yan, Zheng 提交于
      GFP_NOFS memory allocation is required for page writeback path.
      But there is no need to use GFP_NOFS in syscall path and readpage
      path
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      687265e5
    • Y
      f66fd9f0
    • Y
      ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference · 5dda377c
      Yan, Zheng 提交于
      In most cases that snap context is needed, we are holding
      reference of CEPH_CAP_FILE_WR. So we can set ceph inode's
      i_head_snapc when getting the CEPH_CAP_FILE_WR reference,
      and make codes get snap context from i_head_snapc. This makes
      the code simpler.
      
      Another benefit of this change is that we can handle snap
      notification more elegantly. Especially when snap context
      is updated while someone else is doing write. The old queue
      cap_snap code may set cap_snap's context to ether the old
      context or the new snap context, depending on if i_head_snapc
      is set. The new queue capp_snap code always set cap_snap's
      context to the old snap context.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      5dda377c
    • Y
      libceph: allow setting osd_req_op's flags · 144cba14
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      144cba14
  10. 24 6月, 2015 1 次提交
  11. 16 4月, 2015 1 次提交
  12. 12 4月, 2015 4 次提交
  13. 26 3月, 2015 1 次提交
  14. 13 3月, 2015 1 次提交
  15. 23 2月, 2015 1 次提交
    • D
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells 提交于
      Convert the following where appropriate:
      
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
      
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
      
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      
      The following perl+coccinelle script was used:
      
      use strict;
      
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      close($fd);
      unless (@callers) {
          print "No matches\n";
          exit(0);
      }
      
      my @cocci = (
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      close($fd);
      
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      }
      
      [AV: overlayfs parts skipped]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e36cb0b8
  16. 19 2月, 2015 3 次提交
    • Y
      ceph: fix atomic_open snapdir · bf91c315
      Yan, Zheng 提交于
      ceph_handle_snapdir() checks ceph_mdsc_do_request()'s return value
      and creates snapdir inode if it's -ENOENT
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      bf91c315
    • Y
      ceph: fix reading inline data when i_size > PAGE_SIZE · fcc02d2a
      Yan, Zheng 提交于
      when inode has inline data but its size > PAGE_SIZE (it was truncated
      to larger size), previous direct read code return -EIO. This patch adds
      code to return zeros for data whose offset > PAGE_SIZE.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      fcc02d2a
    • Y
      ceph: properly zero data pages for file holes. · 1487a688
      Yan, Zheng 提交于
      A bug is found in striped_read() of fs/ceph/file.c. striped_read() calls
      ceph_zero_pape_vector_range().  The first argument, page_align + read + ret,
      passed to ceph_zero_pape_vector_range() is wrong.
      
      When a file has holes, this wrong parameter may cause memory corruption
      either in kernal space or user space. Kernel space memory may be corrupted in
      the case of non direct IO; user space memory may be corrupted in the case of
      direct IO. In the latter case, the application doing direct IO may crash due
      to memory corruption, as we have experienced.
      
      The correct value should be initial_align + read + ret, where intial_align =
      o_direct ? buf_align : io_align.  Compared with page_align, the current page
      offest, initial_align is the initial page offest, which should be used to
      calculate the page and offset in ceph_zero_pape_vector_range().
      Reported-by: Ncaifeng zhu <zhucaifeng@unissoft-nj.com>
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      1487a688
  17. 21 1月, 2015 1 次提交
  18. 18 12月, 2014 2 次提交