提交 · 3977058c468b872c6bc5e5273bf911d791848643 · openeuler / raspberrypi-kernel

05 4月, 2014 23 次提交

libceph: safely decode max_osd value in osdmap_decode() · 3977058c

由 Ilya Dryomov 提交于 3月 13, 2014

max_osd value is not covered by any ceph_decode_need().  Use a safe
version of ceph_decode_* macro to decode it.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

3977058c

libceph: fixup error handling in osdmap_decode() · 597b52f6

由 Ilya Dryomov 提交于 3月 13, 2014

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

597b52f6

libceph: split osdmap allocation and decode steps · a2505d63

由 Ilya Dryomov 提交于 3月 13, 2014

Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

a2505d63

libceph: dump osdmap and enhance output on decode errors · 38a8d560

由 Ilya Dryomov 提交于 3月 13, 2014

Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset.  dout() map epoch
and max_osd value on success.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

38a8d560

libceph: dump pg_temp mappings to debugfs · 1c00240e

由 Ilya Dryomov 提交于 3月 13, 2014

Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

1c00240e

libceph: do not prefix osd lines with \t in debugfs output · 0a2800d7

由 Ilya Dryomov 提交于 3月 13, 2014

To save screen space in anticipation of more fields (e.g. primary
affinity).
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0a2800d7

libceph: refer to osdmap directly in osdmap_show() · 35fea3a1

由 Ilya Dryomov 提交于 3月 13, 2014

To make it more readable and save screen space.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

35fea3a1

crush: support chooseleaf_vary_r tunable (tunables3) by default · 07bd7de4

由 Ilya Dryomov 提交于 3月 19, 2014

Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

07bd7de4

crush: add SET_CHOOSELEAF_VARY_R step · d83ed858

由 Ilya Dryomov 提交于 3月 19, 2014

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

d83ed858

crush: add chooseleaf_vary_r tunable · e2b149cc

由 Ilya Dryomov 提交于 3月 19, 2014

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

e2b149cc

crush: allow crush rules to set (re)tries counts to 0 · 6ed1002f

由 Ilya Dryomov 提交于 3月 19, 2014

These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

6ed1002f

crush: fix off-by-one errors in total_tries refactor · 48a163db

由 Ilya Dryomov 提交于 3月 19, 2014

Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

48a163db

Y
ceph: don't include ceph.{file,dir}.layout vxattr in listxattr() · cc48c3e8
由 Yan, Zheng 提交于 3月 24, 2014
```
This avoids 'cp -a' modifying layout of new files/directories.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
```
cc48c3e8

ceph: check buffer size in ceph_vxattrcb_layout() · 1e5c6649

由 Yan, Zheng 提交于 3月 24, 2014

If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

1e5c6649

ceph: fix null pointer dereference in discard_cap_releases() · 00bd8edb

由 Yan, Zheng 提交于 3月 24, 2014

send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

00bd8edb

libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance() · d90deda6

由 Yan, Zheng 提交于 3月 23, 2014

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

d90deda6

ceph: Remove get/set acl on symlinks · 5f75ce57

由 Fabian Frederick 提交于 3月 21, 2014

Remove unsupported symlink operations.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

5f75ce57

ceph: set mds_wanted when MDS reply changes a cap to auth cap · d9ffc4f7

由 Yan, Zheng 提交于 3月 18, 2014

When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

d9ffc4f7

ceph: use fl->fl_file as owner identifier of flock and posix lock · eb13e832

由 Yan, Zheng 提交于 3月 09, 2014

flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

eb13e832

Y
ceph: forbid mandatory file lock · eb70c0ce
由 Yan, Zheng 提交于 3月 04, 2014
```
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
```
eb70c0ce

ceph: use fl->fl_type to decide flock operation · 0e8e95d6

由 Yan, Zheng 提交于 3月 04, 2014

VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

0e8e95d6

ceph: update i_max_size even if inode version does not change · 8c93cd61

由 Yan, Zheng 提交于 3月 08, 2014

handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

8c93cd61

ceph: make sure write caps are registered with auth MDS · a2550604

由 Yan, Zheng 提交于 3月 08, 2014

Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

a2550604

03 4月, 2014 17 次提交

Y
ceph: print inode number for LOOKUPINO request · c137a32a
由 Yan, Zheng 提交于 3月 01, 2014
```
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>
```
c137a32a

ceph: add get_name() NFS export callback · 19913b4e

由 Yan, Zheng 提交于 3月 06, 2014

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

19913b4e

ceph: fix ceph_fh_to_parent() · 8996f4f2

由 Yan, Zheng 提交于 3月 01, 2014

ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

8996f4f2

ceph: add get_parent() NFS export callback · 9017c2ec

由 Yan, Zheng 提交于 3月 01, 2014

The callback uses LOOKUPPARENT MDS request to find parent.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

9017c2ec

ceph: simplify ceph_fh_to_dentry() · 4f32b42d

由 Yan, Zheng 提交于 3月 01, 2014

MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NSage Weil <sage@inktank.com>

4f32b42d

ceph: fscache: Wait for completion of object initialization · f1fc4fee

由 Yunchuan Wen 提交于 12月 26, 2013

The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.
Tested-by: NMilosz Tanski <milosz@adfin.com>
Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: NMin Chen <minchen@ubuntukylin.com>
Signed-off-by: NLi Wang <liwang@ubuntukylin.com>

f1fc4fee

ceph: fscache: Update object store limit after file writing · 32d3e148

由 Yunchuan Wen 提交于 12月 26, 2013

Synchronize object->store_limit[_l] with new inode->i_size after file writing.
Tested-by: NMilosz Tanski <milosz@adfin.com>
Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: NMin Chen <minchen@ubuntukylin.com>
Signed-off-by: NLi Wang <liwang@ubuntukylin.com>

32d3e148

ceph: fscache: add an interface to synchronize object store limit · 020c4bdd

由 Yunchuan Wen 提交于 12月 26, 2013

Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size
Tested-by: NMilosz Tanski <milosz@adfin.com>
Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: NMin Chen <minchen@ubuntukylin.com>
Signed-off-by: NLi Wang <liwang@ubuntukylin.com>

020c4bdd

ceph: do not set r_old_dentry_dir on link() · 4b58c9b1

由 Sage Weil 提交于 2月 05, 2013

This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NSage Weil <sage@inktank.com>
Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>

4b58c9b1

ceph: do not assume r_old_dentry[_dir] always set together · 844d87c3

由 Sage Weil 提交于 2月 05, 2013

Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.
Signed-off-by: NSage Weil <sage@inktank.com>
Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>

844d87c3

ceph: do not chain inode updates to parent fsync · 752c8bdc

由 Sage Weil 提交于 2月 05, 2013

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.
Reported-by: NAl Viro <viro@xeniv.linux.org.uk>
Signed-off-by: NSage Weil <sage@inktank.com>
Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>

752c8bdc

ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename() · 180061a5

由 Sage Weil 提交于 2月 05, 2013

This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: NSage Weil <sage@inktank.com>
Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>

180061a5

ceph: let MDS adjust readdir 'frag' · 15289dc8

由 Yan, Zheng 提交于 3月 03, 2014

If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

15289dc8

ceph: fix reset_readdir() · dcd3cc05

由 Yan, Zheng 提交于 2月 28, 2014

When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

dcd3cc05

ceph: fix ceph_dir_llseek() · f0494206

由 Yan, Zheng 提交于 2月 27, 2014

Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

f0494206

rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op · 0ccd5926

由 Ilya Dryomov 提交于 2月 25, 2014

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order).  Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0ccd5926

rbd: num_ops parameter for rbd_osd_req_create() · deb236b3

由 Ilya Dryomov 提交于 2月 25, 2014

In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create().  The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

deb236b3