提交 · 71a228bc8d65900179e37ac309e678f8c523f133 · openeuler / Kernel

16 9月, 2019 5 次提交

ceph: reconnect connection if session hang in opening state · 71a228bc

由 Erqi Chen 提交于 8月 28, 2019

If client mds session is evicted in CEPH_MDS_SESSION_OPENING state,
mds won't send session msg to client, and delayed_work skip
CEPH_MDS_SESSION_OPENING state session, the session hang forever.

Allow ceph_con_keepalive to reconnect a session in OPENING to avoid
session hang. Also, ensure that we skip sessions in RESTARTING and
REJECTED states since those states can't be resurrected by issuing
a keepalive.

Link: https://tracker.ceph.com/issues/41551
Signed-off-by: Erqi Chen chenerqi@gmail.com
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

71a228bc

ceph: eliminate session->s_trim_caps · 533a2818

由 Jeff Layton 提交于 7月 19, 2019

It's only used to keep count of caps being trimmed, but that requires
that we hold the session->s_mutex to prevent multiple trimming
operations from running concurrently.

We can achieve the same effect using an integer on the stack, which
allows us to (eventually) not need the s_mutex.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

533a2818

ceph: auto reconnect after blacklisted · 131d7eb4

由 Yan, Zheng 提交于 7月 25, 2019

Make client use osd reply and session message to infer if itself is
blacklisted. Client reconnect to cluster using new entity addr if it
is blacklisted. Auto reconnect is limited to once every 30 minutes.

Auto reconnect is disabled by default. It can be enabled/disabled by
recover_session=<no|clean> mount option. In 'clean' mode, client drops
any dirty data/metadata, invalidates page caches and invalidates all
writable file handles. After reconnect, file locks become stale because
MDS loses track of them. If an inode contains any stale file locks,
read/write on the indoe are not allowed until applications release all
stale file locks.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

131d7eb4

ceph: add helper function that forcibly reconnects to ceph cluster. · d468e729

由 Yan, Zheng 提交于 7月 25, 2019

It closes mds sessions, drop all caps and invalidates page caches,
then use new entity address to reconnect to the cluster.

After reconnect, all dirty data/metadata are dropped, file locks
get lost sliently. Open files continue to work because client will
try renewing caps on later read/write.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d468e729

ceph: track and report error of async metadata operation · f4b97866

由 Yan, Zheng 提交于 7月 25, 2019

Use errseq_t to track and report errors of async metadata operations,
similar to how kernel handles errors during writeback.

If any dirty caps or any unsafe request gets dropped during session
eviction, record -EIO in corresponding inode's i_meta_err. The error
will be reported by subsequent fsync,
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f4b97866

08 7月, 2019 7 次提交

ceph: add change_attr field to ceph_inode_info · a35ead31

由 Jeff Layton 提交于 6月 06, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

a35ead31

ceph: add btime field to ceph_inode_info · 245ce991

由 Jeff Layton 提交于 5月 29, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

245ce991

ceph: remove request from waiting list before unregister · 428138c9

由 Yan, Zheng 提交于 6月 14, 2019

Link: https://tracker.ceph.com/issues/40339Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

428138c9

ceph: don't blindly unregister session that is in opening state · 6f0f597b

由 Yan, Zheng 提交于 6月 10, 2019

handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f0f597b

ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg() · 8f2a98ef

由 Yan, Zheng 提交于 5月 23, 2019

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8f2a98ef

ceph: use READ_ONCE to access d_parent in RCU critical section · 41883ba8

由 Yan, Zheng 提交于 5月 23, 2019

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

41883ba8

ceph: carry snapshot creation time with inodes · 193e7b37

由 David Disseldorp 提交于 4月 18, 2019

MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.
Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

193e7b37

28 6月, 2019 1 次提交

ceph: fix ceph_mdsc_build_path to not stop on first component · d6b8bd67

由 Jeff Layton 提交于 5月 09, 2019

When ceph_mdsc_build_path is handed a positive dentry, it will return a
zero-length path string with the base set to that dentry.  This is not
what we want.  Always include at least one path component in the string.

ceph_mdsc_build_path has behaved this way for a long time but it didn't
matter until recent d_name handling rework.

Fixes: 964fff74 ("ceph: use ceph_mdsc_build_path instead of clone_dentry_name")
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d6b8bd67

06 6月, 2019 1 次提交

ceph: avoid iput_final() while holding mutex or in dispatch thread · 3e1d0452

由 Yan, Zheng 提交于 5月 18, 2019

iput_final() may wait for reahahead pages. The wait can cause deadlock.
For example:

  Workqueue: ceph-msgr ceph_con_workfn [libceph]
    Call Trace:
     schedule+0x36/0x80
     io_schedule+0x16/0x40
     __lock_page+0x101/0x140
     truncate_inode_pages_range+0x556/0x9f0
     truncate_inode_pages_final+0x4d/0x60
     evict+0x182/0x1a0
     iput+0x1d2/0x220
     iterate_session_caps+0x82/0x230 [ceph]
     dispatch+0x678/0xa80 [ceph]
     ceph_con_workfn+0x95b/0x1560 [libceph]
     process_one_work+0x14d/0x410
     worker_thread+0x4b/0x460
     kthread+0x105/0x140
     ret_from_fork+0x22/0x40

  Workqueue: ceph-msgr ceph_con_workfn [libceph]
    Call Trace:
     __schedule+0x3d6/0x8b0
     schedule+0x36/0x80
     schedule_preempt_disabled+0xe/0x10
     mutex_lock+0x2f/0x40
     ceph_check_caps+0x505/0xa80 [ceph]
     ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
     writepages_finish+0x2d3/0x410 [ceph]
     __complete_request+0x26/0x60 [libceph]
     handle_reply+0x6c8/0xa10 [libceph]
     dispatch+0x29a/0xbb0 [libceph]
     ceph_con_workfn+0x95b/0x1560 [libceph]
     process_one_work+0x14d/0x410
     worker_thread+0x4b/0x460
     kthread+0x105/0x140
     ret_from_fork+0x22/0x40

In above example, truncate_inode_pages_range() waits for readahead pages
while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
OSD dispatch thread. Later OSD replies (for readahead) can't be handled.

ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
can happen if iput_final() is called while holding snap_rwsem.

In general, it's not good to call iput_final() inside MDS/OSD dispatch
threads or while holding any mutex.

The fix is introducing ceph_async_iput(), which calls iput_final() in
workqueue.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3e1d0452

08 5月, 2019 11 次提交

ceph: fix unaligned access in ceph_send_cap_releases · 4198aba4

由 Jeff Layton 提交于 5月 02, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4198aba4

ceph: just call get_session in __ceph_lookup_mds_session · 488f5284

由 Jeff Layton 提交于 4月 24, 2019

I originally thought there was a potential race here, but the fact
that this is called with the mdsc->mutex held, ensures that the
last reference to the session can't be put here.

Still, it's clearer to just return the value from get_session here,
and may prevent a bug later if we ever rework this code to be less
reliant on mutexes.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

488f5284

ceph: move wait for mds request into helper function · 8340f22c

由 Jeff Layton 提交于 4月 02, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8340f22c

ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request · 86bda539

由 Jeff Layton 提交于 4月 02, 2019

Nothing calls ceph_mdsc_submit_request today, but in later patches we'll
need to be able to call this separately.

Have the helper return an int so we can check the r_err under the mutex,
and have the caller just check the error code from the submit. Also move
the acquisition of CEPH_CAP_PIN references into the same function.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

86bda539

ceph: after an MDS request, do callback and completions · 111c7081

由 Jeff Layton 提交于 4月 02, 2019

No MDS requests use r_callback today, but that will change in the
future. The OSD client always does r_callback and then completes
r_completion. Let's have the MDS client do the same.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

111c7081

ceph: use pathlen values returned by set_request_path_attr · c1dfc277

由 Jeff Layton 提交于 4月 17, 2019

We make copies of the dentry name in set_request_path_attr, but then
create_request_message re-fetches the lengths out of the dentry. While
we don't currently set the *_drop fields unless the parents are locked,
it's still better not to rely on that sort of implicit assumption.

Use the pathlen values that set_request_path_attr returned instead, as
they will always be correct for the returned paths themselves.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

c1dfc277

ceph: use __getname/__putname in ceph_mdsc_build_path · f77f21bb

由 Jeff Layton 提交于 4月 29, 2019

Al suggested we get rid of the kmalloc here and just use __getname
and __putname to get a full PATH_MAX pathname buffer.

Since we build the path in reverse, we continue to return a pointer
to the beginning of the string and the length, and add a new helper
to free the thing at the end.
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f77f21bb

ceph: use ceph_mdsc_build_path instead of clone_dentry_name · 964fff74

由 Jeff Layton 提交于 4月 29, 2019

While it may be slightly more efficient, it's probably not worthwhile to
optimize for the case that clone_dentry_name handles. We can get the
same result by just calling ceph_mdsc_build_path when the parent isn't
locked, with less code duplication.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

964fff74

ceph: fix potential use-after-free in ceph_mdsc_build_path · 69a10fb3

由 Jeff Layton 提交于 4月 26, 2019

temp is not defined outside of the RCU critical section here. Ensure
we grab that value before we drop the rcu_read_lock.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

69a10fb3

ceph: make iterate_session_caps a public symbol · f5d77269

由 Jeff Layton 提交于 4月 24, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f5d77269

ceph: quota: fix quota subdir mounts · 0c44a8e0

由 Luis Henriques 提交于 3月 21, 2019

The CephFS kernel client does not enforce quotas set in a directory that
isn't visible from the mount point.  For example, given the path
'/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with

  mount -t ceph <server>:<port>:/dir1/ /mnt

then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
to a quota realm that points to it.

This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
unknown inodes.  Any inode reference obtained this way will be added to a
list in ceph_mds_client, and will only be released when the filesystem is
umounted.

Link: https://tracker.ceph.com/issues/38482Reported-by: NHendrik Peyerl <hpeyerl@plusline.net>
Signed-off-by: NLuis Henriques <lhenriques@suse.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0c44a8e0

24 4月, 2019 2 次提交

ceph: fix ci->i_head_snapc leak · 37659182

由 Yan, Zheng 提交于 4月 18, 2019

We missed two places that i_wrbuffer_ref_head, i_wr_ref, i_dirty_caps
and i_flushing_caps may change. When they are all zeros, we should free
i_head_snapc.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/38224Reported-and-tested-by: NLuis Henriques <lhenriques@suse.com>
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

37659182

ceph: only use d_name directly when parent is locked · 1bcb3440

由 Jeff Layton 提交于 4月 15, 2019

Ben reported tripping the BUG_ON in create_request_message during some
performance testing. Analysis of the vmcore showed that the length of
the r_dentry->d_name string changed after we allocated the buffer, but
before we encoded it.

build_dentry_path returns pointers to d_name in the common case of
non-snapped dentries, but this optimization isn't safe unless the parent
directory is locked. When it isn't, have the code make a copy of the
d_name while holding the d_lock.

Cc: stable@vger.kernel.org
Reported-by: NBen England <bengland@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1bcb3440

06 3月, 2019 9 次提交

ceph: add mount option to limit caps count · fe33032d

由 Yan, Zheng 提交于 2月 01, 2019

If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.

By default, there is no limit for caps count.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

fe33032d

ceph: periodically trim stale dentries · 37c4efc1

由 Yan, Zheng 提交于 1月 31, 2019

Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.

There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

37c4efc1

ceph: delete stale dentry when last reference is dropped · 1e9c2eb6

由 Yan, Zheng 提交于 1月 28, 2019

introduce ceph_d_delete(), which checks if dentry has valid lease.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1e9c2eb6

ceph: send cap releases more aggressively · e3ec8d68

由 Yan, Zheng 提交于 1月 14, 2019

When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

e3ec8d68

ceph: support getting ceph.dir.pin vxattr · 08796873

由 Yan, Zheng 提交于 1月 09, 2019

Link: http://tracker.ceph.com/issues/37576Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

08796873

ceph: support versioned reply · b37fe1f9

由 Yan, Zheng 提交于 1月 09, 2019

In versioned reply, inodestat, dirstat and lease are encoded with
version, compat_version and struct_len.

Based on a patch from Jos Collin <jcollin@redhat.com>.

Link: http://tracker.ceph.com/issues/26936Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

b37fe1f9

ceph: map snapid to anonymous bdev ID · 75c9627e

由 Yan, Zheng 提交于 12月 14, 2017

ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.

This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.

Link: http://tracker.ceph.com/issues/22353Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Acked-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

75c9627e

Y
ceph: split large reconnect into multiple messages · 81c5a148
由 Yan, Zheng 提交于 1月 01, 2019
```
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
81c5a148

ceph: decode feature bits in session message · 84bf3950

由 Yan, Zheng 提交于 12月 21, 2018

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

84bf3950

26 12月, 2018 2 次提交

ceph: don't encode inode pathes into reconnect message · 5ccedf1c

由 Yan, Zheng 提交于 12月 13, 2018

mds hasn't used inode pathes since introducing inode backtrace.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5ccedf1c

ceph: update wanted caps after resuming stale session · d2f8bb27

由 Yan, Zheng 提交于 12月 10, 2018

mds contains an optimization, it does not re-issue stale caps if
client does not want any cap.

A special case of the optimization is that client wants some caps,
but skipped updating 'wanted'. For this case, client needs to update
'wanted' when stale session get renewed.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d2f8bb27

09 11月, 2018 1 次提交

libceph: assume argonaut on the server side · 23c625ce

由 Ilya Dryomov 提交于 11月 08, 2018

No one is running pre-argonaut.  In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.

Allow for the possibility of reusing these feature bits later.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

23c625ce

22 10月, 2018 1 次提交

libceph: preallocate message data items · 0d9c1ab3

由 Ilya Dryomov 提交于 10月 15, 2018

Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.
Reported-by: NJerry Lee <leisurelysw24@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0d9c1ab3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功