1. 28 4月, 2021 3 次提交
  2. 12 10月, 2020 4 次提交
  3. 25 8月, 2020 1 次提交
  4. 24 8月, 2020 2 次提交
    • J
      ceph: fix inode number handling on arches with 32-bit ino_t · ebce3eb2
      Jeff Layton 提交于
      Tuan and Ulrich mentioned that they were hitting a problem on s390x,
      which has a 32-bit ino_t value, even though it's a 64-bit arch (for
      historical reasons).
      
      I think the current handling of inode numbers in the ceph driver is
      wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
      actually not a problem. 32-bit arches can deal with 64-bit inode numbers
      just fine when userland code is compiled with LFS support (the common
      case these days).
      
      What we really want to do is just use 64-bit numbers everywhere, unless
      someone has mounted with the ino32 mount option. In that case, we want
      to ensure that we hash the inode number down to something that will fit
      in 32 bits before presenting the value to userland.
      
      Add new helper functions that do this, and only do the conversion before
      presenting these values to userland in getattr and readdir.
      
      The inode table hashvalue is changed to just cast the inode number to
      unsigned long, as low-order bits are the most likely to vary anyway.
      
      While it's not strictly required, we do want to put something in
      inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
      the size of the ino_t type.
      
      NOTE: This is a user-visible change on 32-bit arches:
      
      1/ inode numbers will be seen to have changed between kernel versions.
         32-bit arches will see large inode numbers now instead of the hashed
         ones they saw before.
      
      2/ any really old software not built with LFS support may start failing
         stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
         can do about these, but hopefully the intersection of people running
         such code on ceph will be very small.
      
      The workaround for both problems is to mount with "-o ino32".
      
      [ idryomov: changelog tweak ]
      
      URL: https://tracker.ceph.com/issues/46828Reported-by: NUlrich Weigand <Ulrich.Weigand@de.ibm.com>
      Reported-and-Tested-by: NTuan Hoang1 <Tuan.Hoang1@ibm.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      ebce3eb2
    • G
      treewide: Use fallthrough pseudo-keyword · df561f66
      Gustavo A. R. Silva 提交于
      Replace the existing /* fall through */ comments and its variants with
      the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
      fall-through markings when it is the case.
      
      [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      df561f66
  5. 03 8月, 2020 1 次提交
  6. 01 6月, 2020 1 次提交
  7. 14 4月, 2020 1 次提交
  8. 30 3月, 2020 7 次提交
    • Y
      ceph: simplify calling of ceph_get_fmode() · 135e671e
      Yan, Zheng 提交于
      Originally, calling ceph_get_fmode() for open files is by thread that
      handles request reply. There is a small window between updating caps and
      and waking the request initiator. We need to prevent ceph_check_caps()
      from releasing wanted caps in the window.
      
      Previous patches made fill_inode() call __ceph_touch_fmode() for open file
      requests. This prevented ceph_check_caps() from releasing wanted caps for
      'caps_wanted_delay_min' seconds, enough for request initiator to get
      woken up and call ceph_get_fmode().
      
      This allows us to now call ceph_get_fmode() in ceph_open() instead.
      Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      135e671e
    • Y
      ceph: remove delay check logic from ceph_check_caps() · a0d93e32
      Yan, Zheng 提交于
      __ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and
      'caps_wanted_delay_max'. There is no need to duplicate the logic in
      ceph_check_caps() and __send_cap()
      Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a0d93e32
    • Y
      ceph: consider inode's last read/write when calculating wanted caps · 719a2514
      Yan, Zheng 提交于
      Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
      used to track the last time the client acquired read/write caps for
      the inode.
      
      If there is no read/write on an inode for 'caps_wanted_delay_max'
      seconds, __ceph_caps_file_wanted() does not request caps for read/write
      even there are open files.
      
      Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
      calculates dir's wanted caps according to last dir read/modification. If
      there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
      there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.
      
      Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
      readdir, as with that, modifications do not need to release
      CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.
      Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      719a2514
    • Y
      ceph: update dentry lease for async create · 3313f66a
      Yan, Zheng 提交于
      Otherwise ceph_d_delete() may return 1 for the dentry, which makes
      dput() prune the dentry and clear parent dir's complete flag.
      Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3313f66a
    • J
      ceph: attempt to do async create when possible · 9a8d03ca
      Jeff Layton 提交于
      With the Octopus release, the MDS will hand out directory create caps.
      
      If we have Fxc caps on the directory, and complete directory information
      or a known negative dentry, then we can return without waiting on the
      reply, allowing the open() call to return very quickly to userland.
      
      We use the normal ceph_fill_inode() routine to fill in the inode, so we
      have to gin up some reply inode information with what we'd expect the
      newly-created inode to have. The client assumes that it has a full set
      of caps on the new inode, and that the MDS will revoke them when there
      is conflicting access.
      
      This functionality is gated on the wsync/nowsync mount options.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9a8d03ca
    • J
      ceph: cache layout in parent dir on first sync create · 785892fe
      Jeff Layton 提交于
      If a create is done, then typically we'll end up writing to the file
      soon afterward. We don't want to wait for the reply before doing that
      when doing an async create, so that means we need the layout for the
      new file before we've gotten the response from the MDS.
      
      All files created in a directory will initially inherit the same layout,
      so copy off the requisite info from the first synchronous create in the
      directory, and save it in a new i_cached_layout field. Zero out the
      layout when we lose Dc caps in the dir.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      785892fe
    • L
      ceph: re-org copy_file_range and fix some error paths · 1b0c3b9f
      Luis Henriques 提交于
      This patch re-organizes copy_file_range, trying to fix a few issues in the
      error handling.  Here's the summary:
      
      - Abort copy if initial do_splice_direct() returns fewer bytes than
        requested.
      
      - Move the 'size' initialization (with i_size_read()) further down in the
        code, after the initial call to do_splice_direct().  This avoids issues
        with a possibly stale value if a manual copy is done.
      
      - Move the object copy loop into a separate function.  This makes it
        easier to handle errors (e.g, dirtying caps and updating the MDS
        metadata if only some objects have been copied before an error has
        occurred).
      
      - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc
        and dst_oloc
      
      - After the object copy loop, the new file size to be reported to the MDS
        (if there's file size change) is now the actual file size, and not the
        size after an eventual extra manual copy.
      
      - Added a few dout() to show the number of bytes copied in the two manual
        copies and in the object copy loop.
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      1b0c3b9f
  9. 23 3月, 2020 1 次提交
    • I
      ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL · 76142097
      Ilya Dryomov 提交于
      CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
      per-pool flags as well.  Unfortunately the backwards compatibility here
      is lacking:
      
      - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
        was guarded by require_osd_release >= RELEASE_LUMINOUS
      - it was subsequently backported to luminous in v12.2.2, but that makes
        no difference to clients that only check OSDMAP_FULL/NEARFULL because
        require_osd_release is not client-facing -- it is for OSDs
      
      Since all kernels are affected, the best we can do here is just start
      checking both map flags and pool flags and send that to stable.
      
      These checks are best effort, so take osdc->lock and look up pool flags
      just once.  Remove the FIXME, since filesystem quotas are checked above
      and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
      its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.
      
      Cc: stable@vger.kernel.org
      Reported-by: NYanhu Cao <gmayyyha@gmail.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Acked-by: NSage Weil <sage@redhat.com>
      76142097
  10. 12 2月, 2020 1 次提交
  11. 27 1月, 2020 1 次提交
  12. 15 11月, 2019 2 次提交
    • J
      ceph: increment/decrement dio counter on async requests · 6a81749e
      Jeff Layton 提交于
      Ceph can in some cases issue an async DIO request, in which case we can
      end up calling ceph_end_io_direct before the I/O is actually complete.
      That may allow buffered operations to proceed while DIO requests are
      still in flight.
      
      Fix this by incrementing the i_dio_count when issuing an async DIO
      request, and decrement it when tearing down the aio_req.
      
      Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes")
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6a81749e
    • J
      ceph: take the inode lock before acquiring cap refs · a81bc310
      Jeff Layton 提交于
      Most of the time, we (or the vfs layer) takes the inode_lock and then
      acquires caps, but ceph_read_iter does the opposite, and that can lead
      to a deadlock.
      
      When there are multiple clients treading over the same data, we can end
      up in a situation where a reader takes caps and then tries to acquire
      the inode_lock. Another task holds the inode_lock and issues a request
      to the MDS which needs to revoke the caps, but that can't happen until
      the inode_lock is unwedged.
      
      Fix this by having ceph_read_iter take the inode_lock earlier, before
      attempting to acquire caps.
      
      Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes")
      Link: https://tracker.ceph.com/issues/36348Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a81bc310
  13. 05 11月, 2019 2 次提交
    • L
      ceph: don't allow copy_file_range when stripe_count != 1 · a3a08193
      Luis Henriques 提交于
      copy_file_range tries to use the OSD 'copy-from' operation, which simply
      performs a full object copy.  Unfortunately, the implementation of this
      system call assumes that stripe_count is always set to 1 and doesn't take
      into account that the data may be striped across an object set.  If the
      file layout has stripe_count different from 1, then the destination file
      data will be corrupted.
      
      For example:
      
      Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
      stripe_size of 2 MiB; the first half of the file will be filled with 'A's
      and the second half will be filled with 'B's:
      
                     0      4M     8M       Obj1     Obj2
                     +------+------+       +----+   +----+
              file:  | AAAA | BBBB |       | AA |   | AA |
                     +------+------+       |----|   |----|
                                           | BB |   | BB |
                                           +----+   +----+
      
      If we copy_file_range this file into a new file (which needs to have the
      same file layout!), then it will start by copying the object starting at
      file offset 0 (Obj1).  And then it will copy the object starting at file
      offset 4M -- which is Obj1 again.
      
      Unfortunately, the solution for this is to not allow remote object copies
      to be performed when the file layout stripe_count is not 1 and simply
      fallback to the default (VFS) copy_file_range implementation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a3a08193
    • J
      ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open · 5bb5e6ee
      Jeff Layton 提交于
      If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
      that it already passed d_revalidate so we *know* that it's negative (or
      at least was very recently). Just return -ENOENT in that case.
      
      This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
      call atomic_open with the parent's i_rwsem shared, but calling
      d_splice_alias on a hashed dentry requires the exclusive lock.
      
      If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
      open, and another client were to race in and create the file before we
      issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
      the dentry with the new inode with insufficient locks.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5bb5e6ee
  14. 23 10月, 2019 1 次提交
    • A
      ceph: fix compat_ioctl for ceph_dir_operations · 18bd6caa
      Arnd Bergmann 提交于
      The ceph_ioctl function is used both for files and directories, but only
      the files support doing that in 32-bit compat mode.
      
      On the s390 architecture, there is also a problem with invalid 31-bit
      pointers that need to be passed through compat_ptr().
      
      Use the new compat_ptr_ioctl() to address both issues.
      
      Note: When backporting this patch to stable kernels, "compat_ioctl:
      add compat_ptr_ioctl()" is needed as well.
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      18bd6caa
  15. 16 9月, 2019 7 次提交
  16. 08 7月, 2019 5 次提交