提交 · fcc95f06403c956e3f50ca4a82db12b66a3078e0 · openeuler / Kernel

30 3月, 2020 12 次提交

ceph: consider inode's last read/write when calculating wanted caps · 719a2514

由 Yan, Zheng 提交于 3月 05, 2020

Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
used to track the last time the client acquired read/write caps for
the inode.

If there is no read/write on an inode for 'caps_wanted_delay_max'
seconds, __ceph_caps_file_wanted() does not request caps for read/write
even there are open files.

Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
calculates dir's wanted caps according to last dir read/modification. If
there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.

Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
readdir, as with that, modifications do not need to release
CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

719a2514

ceph: always renew caps if mds_wanted is insufficient · c0e385b1

由 Yan, Zheng 提交于 3月 05, 2020

Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag,
which indicates that mds has closed the session and caps were dropped.
Remove this flag in preparation for not requesting caps for idle open
files.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

c0e385b1

ceph: cache layout in parent dir on first sync create · 785892fe

由 Jeff Layton 提交于 1月 02, 2020

If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

785892fe

ceph: add new MDS req field to hold delegated inode number · 6deb8008

由 Jeff Layton 提交于 1月 13, 2020

Add new request field to hold the delegated inode number. Encode that
into the message when it's set.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6deb8008

ceph: decode interval_sets for delegated inos · d4846487

由 Jeff Layton 提交于 11月 15, 2019

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d4846487

ceph: cap tracking for async directory operations · a25949b9

由 Jeff Layton 提交于 2月 18, 2020

Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).

Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

a25949b9

ceph: add infrastructure for waiting for async create to complete · 891f3f5a

由 Jeff Layton 提交于 1月 14, 2020

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

891f3f5a

ceph: add flag to designate that a request is asynchronous · 3bb48b41

由 Jeff Layton 提交于 12月 02, 2019

...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3bb48b41

ceph: return ETIMEDOUT errno to userland when request timed out · 8ccf7fcc

由 Xiubo Li 提交于 2月 23, 2020

req->r_timeout is only used during mounting, so this error will
be more accurate.

URL: https://tracker.ceph.com/issues/44215Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8ccf7fcc

ceph: move to a dedicated slabcache for mds requests · 058daab7

由 Jeff Layton 提交于 2月 17, 2020

On my machine (x86_64) this struct is 952 bytes, which gets rounded up
to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
allocate them without the extra 72 bytes of overhead per.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

058daab7

ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO} · 525d15e8

由 Yan, Zheng 提交于 5月 11, 2019

These bits will have new meaning for directory inodes.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

525d15e8

ceph: register MDS request with dir inode from the start · 3db0a2fc

由 Jeff Layton 提交于 4月 04, 2019

When the unsafe reply to a request comes in, the request is put on the
r_unsafe_dir inode's list. In future patches, we're going to need to
wait on requests that may not have gotten an unsafe reply yet.

Change __register_request to put the entry on the dir inode's list when
the pointer is set in the request, and don't check the
CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.

The only place that uses this list today is fsync codepath, and with
the coming changes, we'll want to wait on all operations whether it has
gotten an unsafe reply or not.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3db0a2fc

27 1月, 2020 11 次提交

ceph: print r_direct_hash in hex in __choose_mds() dout · 3c802092

由 Xiubo Li 提交于 1月 01, 2020

It's hard to read, especially when it is:

  ceph:  __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0

At the same time, switch to __func__ to get rid of the checkpatch
warning.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3c802092

ceph: allocate the correct amount of extra bytes for the session features · 9ba1e224

由 Xiubo Li 提交于 1月 08, 2020

The total bytes may potentially be larger than 8.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

9ba1e224

ceph: rename get_session and switch to use ceph_get_mds_session · 5b3248c6

由 Xiubo Li 提交于 12月 19, 2019

Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.

Rename get_session to ceph_get_mds_session and make it global.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5b3248c6

ceph: add possible_max_rank and make the code more readable · b38c9eb4

由 Xiubo Li 提交于 12月 04, 2019

The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.

Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.

Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

b38c9eb4

ceph: retry the same mds later after the new session is opened · c4853e97

由 Xiubo Li 提交于 12月 09, 2019

If max_mds > 1 and a request is submitted that chooses a random mds
rank, and the relating session is not opened yet, the request will wait
until the session has been opened and resend again.

Every time the request goes through __do_request, it will release the
req->session first and choose a random one again, which may be a
completely different rank than the one it just waited on.

In the worst case, it will open all the mds sessions one by one just
before the request can be successfully sent out.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

c4853e97

ceph: check availability of mds cluster on mount after wait timeout · 97820058

由 Xiubo Li 提交于 12月 10, 2019

If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out. A mount
attempt will also fail with EIO if all of the MDS's are laggy.

This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.

URL: https://tracker.ceph.com/issues/4386Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

97820058

ceph: keep the session state until it is released · 4d681c2f

由 Xiubo Li 提交于 12月 05, 2019

When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:

"ceph:  mds%d reconnect denied"

And with the confusing log:

"ceph:  handle_session mds0 close 0000000085804730 state ??? seq 0"

Let's keep the session state until its memories is released.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4d681c2f

ceph: add __send_request helper · 9cf54563

由 Xiubo Li 提交于 12月 05, 2019

Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

9cf54563

ceph: fix possible long time wait during umount · 07edc057

由 Xiubo Li 提交于 12月 04, 2019

During umount, if there has no any unsafe request in the mdsc and
some requests still in-flight and not got reply yet, and if the
rest requets are all safe ones, after that even all of them in mdsc
are unregistered, the umount must wait until after mount_timeout
seconds anyway.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

07edc057

ceph: only choose one MDS who is in up:active state without laggy · 5d47648f

由 Xiubo Li 提交于 11月 26, 2019

Even the MDS is in up:active state, but it also maybe laggy. Here
will skip the laggy MDSs.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5d47648f

ceph: delete redundant douts in con_get/put() · 8f5ac172

由 Chengguang Xu 提交于 7月 12, 2018

We print session's refcount in debug message inside
ceph_put_mds_session() and get_session(), so we don't have to
print it in con_get()/__ceph_lookup_mds_session()/con_put().
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8f5ac172

22 1月, 2020 1 次提交

ceph: hold extra reference to r_parent over life of request · 9c1c2b35

由 Jeff Layton 提交于 4月 03, 2019

Currently, we just assume that it will stick around by virtue of the
submitter's reference, but later patches will allow the syscall to
return early and we can't rely on that reference at that point.

While I'm not aware of any reports of it, Xiubo pointed out that this
may fix a use-after-free.  If the wait for a reply times out or is
canceled via signal, and then the reply comes in after the syscall
returns, the client can end up trying to access r_parent without a
reference.

Take an extra reference to the inode when setting r_parent and release
it when releasing the request.

Cc: stable@vger.kernel.org
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

9c1c2b35

10 12月, 2019 3 次提交

ceph: trigger the reclaim work once there has enough pending caps · bba1560b

由 Xiubo Li 提交于 11月 26, 2019

The nr in ceph_reclaim_caps_nr() is very possibly larger than 1,
so we may miss it and the reclaim work couldn't triggered as expected.
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

bba1560b

ceph: show tasks waiting on caps in debugfs caps file · 3a3430af

由 Jeff Layton 提交于 11月 20, 2019

Add some visibility of tasks that are waiting for caps to the "caps"
debugfs file. Display the tgid of the waiting task, inode number, and
the caps the task needs and wants.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3a3430af

ceph: convert int fields in ceph_mount_options to unsigned int · ad8c28a9

由 Jeff Layton 提交于 9月 09, 2019

Most of these values should never be negative, so convert them to
unsigned values. Add some sanity checking to the parsed values, and
clean up some unneeded casts.

Note that while caps_max should never be negative, this patch leaves
it signed, since this value ends up later being compared to a signed
counter. Just ensure that userland never passes in a negative value
for caps_max.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

ad8c28a9

09 12月, 2019 1 次提交

fs: ceph: Delete timespec64_trunc() usage · 668c9a61

由 Deepa Dinamani 提交于 12月 02, 2019

Since ceph always uses ns granularity, skip the
truncation which is a no-op.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: jlayton@kernel.org
Cc: ceph-devel@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

668c9a61

25 11月, 2019 2 次提交

ceph: don't leave ino field in ceph_mds_request_head uninitialized · 2def865a

由 Jeff Layton 提交于 10月 14, 2019

We currently just pass junk in this field unless we're retransmitting a
create, but in later patches, we'll need a mechanism to pass a delegated
inode number on an initial create request. Prepare for this by ensuring
this field is zeroed out.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

2def865a

ceph: tone down loglevel on ceph_mdsc_build_path warning · f5946bcc

由 Jeff Layton 提交于 10月 16, 2019

When this occurs, it usually means that we raced with a rename, and
there is no need to warn in that case.  Only printk if we pass the
rename sequence check but still ended up with pos < 0.

Either way, this doesn't warrant a KERN_ERR message. Change it to
KERN_WARNING.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f5946bcc

15 10月, 2019 1 次提交

ceph: just skip unrecognized info in ceph_reply_info_extra · 1d3f8723

由 Jeff Layton 提交于 9月 26, 2019

In the future, we're going to want to extend the ceph_reply_info_extra
for create replies. Currently though, the kernel code doesn't accept an
extra blob that is larger than the expected data.

Change the code to skip over any unrecognized fields at the end of the
extra blob, rather than returning -EIO.

Cc: stable@vger.kernel.org
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1d3f8723

16 9月, 2019 5 次提交

ceph: reconnect connection if session hang in opening state · 71a228bc

由 Erqi Chen 提交于 8月 28, 2019

If client mds session is evicted in CEPH_MDS_SESSION_OPENING state,
mds won't send session msg to client, and delayed_work skip
CEPH_MDS_SESSION_OPENING state session, the session hang forever.

Allow ceph_con_keepalive to reconnect a session in OPENING to avoid
session hang. Also, ensure that we skip sessions in RESTARTING and
REJECTED states since those states can't be resurrected by issuing
a keepalive.

Link: https://tracker.ceph.com/issues/41551
Signed-off-by: Erqi Chen chenerqi@gmail.com
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

71a228bc

ceph: eliminate session->s_trim_caps · 533a2818

由 Jeff Layton 提交于 7月 19, 2019

It's only used to keep count of caps being trimmed, but that requires
that we hold the session->s_mutex to prevent multiple trimming
operations from running concurrently.

We can achieve the same effect using an integer on the stack, which
allows us to (eventually) not need the s_mutex.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

533a2818

ceph: auto reconnect after blacklisted · 131d7eb4

由 Yan, Zheng 提交于 7月 25, 2019

Make client use osd reply and session message to infer if itself is
blacklisted. Client reconnect to cluster using new entity addr if it
is blacklisted. Auto reconnect is limited to once every 30 minutes.

Auto reconnect is disabled by default. It can be enabled/disabled by
recover_session=<no|clean> mount option. In 'clean' mode, client drops
any dirty data/metadata, invalidates page caches and invalidates all
writable file handles. After reconnect, file locks become stale because
MDS loses track of them. If an inode contains any stale file locks,
read/write on the indoe are not allowed until applications release all
stale file locks.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

131d7eb4

ceph: add helper function that forcibly reconnects to ceph cluster. · d468e729

由 Yan, Zheng 提交于 7月 25, 2019

It closes mds sessions, drop all caps and invalidates page caches,
then use new entity address to reconnect to the cluster.

After reconnect, all dirty data/metadata are dropped, file locks
get lost sliently. Open files continue to work because client will
try renewing caps on later read/write.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d468e729

ceph: track and report error of async metadata operation · f4b97866

由 Yan, Zheng 提交于 7月 25, 2019

Use errseq_t to track and report errors of async metadata operations,
similar to how kernel handles errors during writeback.

If any dirty caps or any unsafe request gets dropped during session
eviction, record -EIO in corresponding inode's i_meta_err. The error
will be reported by subsequent fsync,
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f4b97866

08 7月, 2019 4 次提交

ceph: add change_attr field to ceph_inode_info · a35ead31

由 Jeff Layton 提交于 6月 06, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

a35ead31

ceph: add btime field to ceph_inode_info · 245ce991

由 Jeff Layton 提交于 5月 29, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

245ce991

ceph: remove request from waiting list before unregister · 428138c9

由 Yan, Zheng 提交于 6月 14, 2019

Link: https://tracker.ceph.com/issues/40339Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

428138c9

ceph: don't blindly unregister session that is in opening state · 6f0f597b

由 Yan, Zheng 提交于 6月 10, 2019

handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f0f597b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功