提交 · 2c66de560fa2dda0a600e908897116914db8f500 · openeuler / Kernel

08 7月, 2019 5 次提交

ceph: remove request from waiting list before unregister · 428138c9

由 Yan, Zheng 提交于 6月 14, 2019

Link: https://tracker.ceph.com/issues/40339Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

428138c9

ceph: don't blindly unregister session that is in opening state · 6f0f597b

由 Yan, Zheng 提交于 6月 10, 2019

handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f0f597b

ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg() · 8f2a98ef

由 Yan, Zheng 提交于 5月 23, 2019

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8f2a98ef

ceph: use READ_ONCE to access d_parent in RCU critical section · 41883ba8

由 Yan, Zheng 提交于 5月 23, 2019

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

41883ba8

ceph: carry snapshot creation time with inodes · 193e7b37

由 David Disseldorp 提交于 4月 18, 2019

MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.
Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

193e7b37

28 6月, 2019 1 次提交

ceph: fix ceph_mdsc_build_path to not stop on first component · d6b8bd67

由 Jeff Layton 提交于 5月 09, 2019

When ceph_mdsc_build_path is handed a positive dentry, it will return a
zero-length path string with the base set to that dentry.  This is not
what we want.  Always include at least one path component in the string.

ceph_mdsc_build_path has behaved this way for a long time but it didn't
matter until recent d_name handling rework.

Fixes: 964fff74 ("ceph: use ceph_mdsc_build_path instead of clone_dentry_name")
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d6b8bd67

06 6月, 2019 1 次提交

ceph: avoid iput_final() while holding mutex or in dispatch thread · 3e1d0452

由 Yan, Zheng 提交于 5月 18, 2019

iput_final() may wait for reahahead pages. The wait can cause deadlock.
For example:

  Workqueue: ceph-msgr ceph_con_workfn [libceph]
    Call Trace:
     schedule+0x36/0x80
     io_schedule+0x16/0x40
     __lock_page+0x101/0x140
     truncate_inode_pages_range+0x556/0x9f0
     truncate_inode_pages_final+0x4d/0x60
     evict+0x182/0x1a0
     iput+0x1d2/0x220
     iterate_session_caps+0x82/0x230 [ceph]
     dispatch+0x678/0xa80 [ceph]
     ceph_con_workfn+0x95b/0x1560 [libceph]
     process_one_work+0x14d/0x410
     worker_thread+0x4b/0x460
     kthread+0x105/0x140
     ret_from_fork+0x22/0x40

  Workqueue: ceph-msgr ceph_con_workfn [libceph]
    Call Trace:
     __schedule+0x3d6/0x8b0
     schedule+0x36/0x80
     schedule_preempt_disabled+0xe/0x10
     mutex_lock+0x2f/0x40
     ceph_check_caps+0x505/0xa80 [ceph]
     ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
     writepages_finish+0x2d3/0x410 [ceph]
     __complete_request+0x26/0x60 [libceph]
     handle_reply+0x6c8/0xa10 [libceph]
     dispatch+0x29a/0xbb0 [libceph]
     ceph_con_workfn+0x95b/0x1560 [libceph]
     process_one_work+0x14d/0x410
     worker_thread+0x4b/0x460
     kthread+0x105/0x140
     ret_from_fork+0x22/0x40

In above example, truncate_inode_pages_range() waits for readahead pages
while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
OSD dispatch thread. Later OSD replies (for readahead) can't be handled.

ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
can happen if iput_final() is called while holding snap_rwsem.

In general, it's not good to call iput_final() inside MDS/OSD dispatch
threads or while holding any mutex.

The fix is introducing ceph_async_iput(), which calls iput_final() in
workqueue.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

3e1d0452

08 5月, 2019 11 次提交

ceph: fix unaligned access in ceph_send_cap_releases · 4198aba4

由 Jeff Layton 提交于 5月 02, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4198aba4

ceph: just call get_session in __ceph_lookup_mds_session · 488f5284

由 Jeff Layton 提交于 4月 24, 2019

I originally thought there was a potential race here, but the fact
that this is called with the mdsc->mutex held, ensures that the
last reference to the session can't be put here.

Still, it's clearer to just return the value from get_session here,
and may prevent a bug later if we ever rework this code to be less
reliant on mutexes.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

488f5284

ceph: move wait for mds request into helper function · 8340f22c

由 Jeff Layton 提交于 4月 02, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8340f22c

ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request · 86bda539

由 Jeff Layton 提交于 4月 02, 2019

Nothing calls ceph_mdsc_submit_request today, but in later patches we'll
need to be able to call this separately.

Have the helper return an int so we can check the r_err under the mutex,
and have the caller just check the error code from the submit. Also move
the acquisition of CEPH_CAP_PIN references into the same function.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

86bda539

ceph: after an MDS request, do callback and completions · 111c7081

由 Jeff Layton 提交于 4月 02, 2019

No MDS requests use r_callback today, but that will change in the
future. The OSD client always does r_callback and then completes
r_completion. Let's have the MDS client do the same.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

111c7081

ceph: use pathlen values returned by set_request_path_attr · c1dfc277

由 Jeff Layton 提交于 4月 17, 2019

We make copies of the dentry name in set_request_path_attr, but then
create_request_message re-fetches the lengths out of the dentry. While
we don't currently set the *_drop fields unless the parents are locked,
it's still better not to rely on that sort of implicit assumption.

Use the pathlen values that set_request_path_attr returned instead, as
they will always be correct for the returned paths themselves.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

c1dfc277

ceph: use __getname/__putname in ceph_mdsc_build_path · f77f21bb

由 Jeff Layton 提交于 4月 29, 2019

Al suggested we get rid of the kmalloc here and just use __getname
and __putname to get a full PATH_MAX pathname buffer.

Since we build the path in reverse, we continue to return a pointer
to the beginning of the string and the length, and add a new helper
to free the thing at the end.
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f77f21bb

ceph: use ceph_mdsc_build_path instead of clone_dentry_name · 964fff74

由 Jeff Layton 提交于 4月 29, 2019

While it may be slightly more efficient, it's probably not worthwhile to
optimize for the case that clone_dentry_name handles. We can get the
same result by just calling ceph_mdsc_build_path when the parent isn't
locked, with less code duplication.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

964fff74

ceph: fix potential use-after-free in ceph_mdsc_build_path · 69a10fb3

由 Jeff Layton 提交于 4月 26, 2019

temp is not defined outside of the RCU critical section here. Ensure
we grab that value before we drop the rcu_read_lock.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

69a10fb3

ceph: make iterate_session_caps a public symbol · f5d77269

由 Jeff Layton 提交于 4月 24, 2019

Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f5d77269

ceph: quota: fix quota subdir mounts · 0c44a8e0

由 Luis Henriques 提交于 3月 21, 2019

The CephFS kernel client does not enforce quotas set in a directory that
isn't visible from the mount point.  For example, given the path
'/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with

  mount -t ceph <server>:<port>:/dir1/ /mnt

then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
to a quota realm that points to it.

This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
unknown inodes.  Any inode reference obtained this way will be added to a
list in ceph_mds_client, and will only be released when the filesystem is
umounted.

Link: https://tracker.ceph.com/issues/38482Reported-by: NHendrik Peyerl <hpeyerl@plusline.net>
Signed-off-by: NLuis Henriques <lhenriques@suse.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0c44a8e0

24 4月, 2019 2 次提交

ceph: fix ci->i_head_snapc leak · 37659182

由 Yan, Zheng 提交于 4月 18, 2019

We missed two places that i_wrbuffer_ref_head, i_wr_ref, i_dirty_caps
and i_flushing_caps may change. When they are all zeros, we should free
i_head_snapc.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/38224Reported-and-tested-by: NLuis Henriques <lhenriques@suse.com>
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

37659182

ceph: only use d_name directly when parent is locked · 1bcb3440

由 Jeff Layton 提交于 4月 15, 2019

Ben reported tripping the BUG_ON in create_request_message during some
performance testing. Analysis of the vmcore showed that the length of
the r_dentry->d_name string changed after we allocated the buffer, but
before we encoded it.

build_dentry_path returns pointers to d_name in the common case of
non-snapped dentries, but this optimization isn't safe unless the parent
directory is locked. When it isn't, have the code make a copy of the
d_name while holding the d_lock.

Cc: stable@vger.kernel.org
Reported-by: NBen England <bengland@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1bcb3440

06 3月, 2019 9 次提交

ceph: add mount option to limit caps count · fe33032d

由 Yan, Zheng 提交于 2月 01, 2019

If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.

By default, there is no limit for caps count.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

fe33032d

ceph: periodically trim stale dentries · 37c4efc1

由 Yan, Zheng 提交于 1月 31, 2019

Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.

There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

37c4efc1

ceph: delete stale dentry when last reference is dropped · 1e9c2eb6

由 Yan, Zheng 提交于 1月 28, 2019

introduce ceph_d_delete(), which checks if dentry has valid lease.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1e9c2eb6

ceph: send cap releases more aggressively · e3ec8d68

由 Yan, Zheng 提交于 1月 14, 2019

When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

e3ec8d68

ceph: support getting ceph.dir.pin vxattr · 08796873

由 Yan, Zheng 提交于 1月 09, 2019

Link: http://tracker.ceph.com/issues/37576Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

08796873

ceph: support versioned reply · b37fe1f9

由 Yan, Zheng 提交于 1月 09, 2019

In versioned reply, inodestat, dirstat and lease are encoded with
version, compat_version and struct_len.

Based on a patch from Jos Collin <jcollin@redhat.com>.

Link: http://tracker.ceph.com/issues/26936Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

b37fe1f9

ceph: map snapid to anonymous bdev ID · 75c9627e

由 Yan, Zheng 提交于 12月 14, 2017

ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.

This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.

Link: http://tracker.ceph.com/issues/22353Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Acked-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

75c9627e

Y
ceph: split large reconnect into multiple messages · 81c5a148
由 Yan, Zheng 提交于 1月 01, 2019
```
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
81c5a148

ceph: decode feature bits in session message · 84bf3950

由 Yan, Zheng 提交于 12月 21, 2018

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

84bf3950

26 12月, 2018 2 次提交

ceph: don't encode inode pathes into reconnect message · 5ccedf1c

由 Yan, Zheng 提交于 12月 13, 2018

mds hasn't used inode pathes since introducing inode backtrace.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5ccedf1c

ceph: update wanted caps after resuming stale session · d2f8bb27

由 Yan, Zheng 提交于 12月 10, 2018

mds contains an optimization, it does not re-issue stale caps if
client does not want any cap.

A special case of the optimization is that client wants some caps,
but skipped updating 'wanted'. For this case, client needs to update
'wanted' when stale session get renewed.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d2f8bb27

09 11月, 2018 1 次提交

libceph: assume argonaut on the server side · 23c625ce

由 Ilya Dryomov 提交于 11月 08, 2018

No one is running pre-argonaut.  In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.

Allow for the possibility of reusing these feature bits later.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

23c625ce

22 10月, 2018 3 次提交

libceph: preallocate message data items · 0d9c1ab3

由 Ilya Dryomov 提交于 10月 15, 2018

Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.
Reported-by: NJerry Lee <leisurelysw24@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0d9c1ab3

libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist() · 89486833

由 Ilya Dryomov 提交于 9月 28, 2018

Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy().  This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init().  I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

89486833

libceph: introduce ceph_pagelist_alloc() · 33165d47

由 Ilya Dryomov 提交于 9月 28, 2018

struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount. Merge allocation and initialization together.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

33165d47

13 8月, 2018 3 次提交

ceph: don't drop message if it contains more data than expected · 0fcf6c02

由 Yan, Zheng 提交于 8月 03, 2018

Later version mds may encode more data into messages.
Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0fcf6c02

ceph: support cephfs' own feature bits · 342ce182

由 Yan, Zheng 提交于 5月 11, 2018

Signed-off-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

342ce182

ceph: change to void return type for __do_request() · d5548492

由 Chengguang Xu 提交于 7月 28, 2018

We do not check return code for __do_request() in all callers,
so change to void return type.
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d5548492

03 8月, 2018 2 次提交

ceph: add new field max_file_size in ceph_fs_client · 719784ba

由 Chengguang Xu 提交于 7月 19, 2018

In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

719784ba

libceph: add authorizer challenge · 6daca13d

由 Ilya Dryomov 提交于 7月 27, 2018

When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply). This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge). The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

6daca13d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功