提交 · 4d60351f9089ef0f39d73c0b6a103e61fc0ed187 · openanolis / cloud-kernel

05 4月, 2014 20 次提交

libceph: switch osdmap_set_max_osd() to krealloc() · 4d60351f

由 Ilya Dryomov 提交于 3月 21, 2014

Use krealloc() instead of rolling our own.  (krealloc() with a NULL
first argument acts as a kmalloc()).  Properly initalize the new array
elements.  This is needed to make future additions to osdmap easier.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

4d60351f

libceph: introduce decode{,_new}_pools() and switch to them · 433fbdd3

由 Ilya Dryomov 提交于 3月 21, 2014

Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

433fbdd3

libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() · 0f70c7ee

由 Ilya Dryomov 提交于 3月 21, 2014

To be in line with all the other osdmap decode helpers.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0f70c7ee

libceph: fix and clarify ceph_decode_need() sizes · 53bbaba9

由 Ilya Dryomov 提交于 3月 13, 2014

Sum up sizeof(...) results instead of (incorrectly) hard-coding the
number of bytes, expressed in ints and longs.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

53bbaba9

libceph: nuke bogus encoding version check in osdmap_apply_incremental() · 9464d008

由 Ilya Dryomov 提交于 3月 13, 2014

Only version 6 of osdmap encoding is supported, anything other than
version 6 results in an error and halts the decoding process.  Checking
if version is >= 5 is therefore bogus.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

9464d008

libceph: fixup error handling in osdmap_apply_incremental() · 86f1742b

由 Ilya Dryomov 提交于 3月 13, 2014

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

86f1742b

libceph: fix crush_decode() call site in osdmap_decode() · 9902e682

由 Ilya Dryomov 提交于 3月 13, 2014

The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length.  Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

9902e682

libceph: check length of osdmap osd arrays · 2d88b2e0

由 Ilya Dryomov 提交于 3月 13, 2014

Check length of osd_state, osd_weight and osd_addr arrays.  They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

2d88b2e0

libceph: safely decode max_osd value in osdmap_decode() · 3977058c

由 Ilya Dryomov 提交于 3月 13, 2014

max_osd value is not covered by any ceph_decode_need().  Use a safe
version of ceph_decode_* macro to decode it.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

3977058c

libceph: fixup error handling in osdmap_decode() · 597b52f6

由 Ilya Dryomov 提交于 3月 13, 2014

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

597b52f6

libceph: split osdmap allocation and decode steps · a2505d63

由 Ilya Dryomov 提交于 3月 13, 2014

Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

a2505d63

libceph: dump osdmap and enhance output on decode errors · 38a8d560

由 Ilya Dryomov 提交于 3月 13, 2014

Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset.  dout() map epoch
and max_osd value on success.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

38a8d560

libceph: dump pg_temp mappings to debugfs · 1c00240e

由 Ilya Dryomov 提交于 3月 13, 2014

Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

1c00240e

libceph: do not prefix osd lines with \t in debugfs output · 0a2800d7

由 Ilya Dryomov 提交于 3月 13, 2014

To save screen space in anticipation of more fields (e.g. primary
affinity).
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0a2800d7

libceph: refer to osdmap directly in osdmap_show() · 35fea3a1

由 Ilya Dryomov 提交于 3月 13, 2014

To make it more readable and save screen space.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

35fea3a1

crush: add SET_CHOOSELEAF_VARY_R step · d83ed858

由 Ilya Dryomov 提交于 3月 19, 2014

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

d83ed858

crush: add chooseleaf_vary_r tunable · e2b149cc

由 Ilya Dryomov 提交于 3月 19, 2014

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

e2b149cc

crush: allow crush rules to set (re)tries counts to 0 · 6ed1002f

由 Ilya Dryomov 提交于 3月 19, 2014

These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

6ed1002f

crush: fix off-by-one errors in total_tries refactor · 48a163db

由 Ilya Dryomov 提交于 3月 19, 2014

Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

48a163db

libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance() · d90deda6

由 Yan, Zheng 提交于 3月 23, 2014

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

d90deda6

03 4月, 2014 3 次提交

libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op · c647b8a8

由 Ilya Dryomov 提交于 2月 25, 2014

This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

c647b8a8

libceph: encode CEPH_OSD_OP_FLAG_* op flags · 7b25bf5f

由 Ilya Dryomov 提交于 2月 25, 2014

Encode ceph_osd_op::flags field so that it gets sent over the wire.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

7b25bf5f

libceph: a per-osdc crush scratch buffer · 9d521470

由 Ilya Dryomov 提交于 1月 31, 2014

With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain. Replace it with a buffer inside
struct osdmap and a mutex. This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

9d521470

08 2月, 2014 3 次提交

libceph: do not dereference a NULL bio pointer · 0ec1d15e

由 Ilya Dryomov 提交于 2月 05, 2014

Commit f38a5181 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1.  Fix it.

Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

0ec1d15e

libceph: take map_sem for read in handle_reply() · ff513ace

由 Ilya Dryomov 提交于 2月 03, 2014

Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case.  (Lock ordering is:
map_sem, request_mutex, crush_mutex.)
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

ff513ace

libceph: factor out logic from ceph_osdc_start_request() · 0bbfdfe8

由 Ilya Dryomov 提交于 1月 31, 2014

Factor out logic from ceph_osdc_start_request() into a new helper,
__ceph_osdc_start_request().  ceph_osdc_start_request() now amounts to
taking locks and calling __ceph_osdc_start_request().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

0bbfdfe8

04 2月, 2014 1 次提交

libceph: fix error handling in ceph_osdc_init() · c172ec5c

由 Ilya Dryomov 提交于 1月 31, 2014

msgpool_op_reply message pool isn't destroyed if workqueue construction
fails.  Fix it.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

c172ec5c

28 1月, 2014 8 次提交

libceph: follow redirect replies from osds · 205ee118

由 Ilya Dryomov 提交于 1月 27, 2014

Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.

v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid.  However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only.  To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

205ee118

libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} · 3c972c95

由 Ilya Dryomov 提交于 1月 27, 2014

Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

3c972c95

libceph: follow {read,write}_tier fields on osd request submission · 17a13e40

由 Ilya Dryomov 提交于 1月 27, 2014

Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9.  This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

17a13e40

libceph: add ceph_pg_pool_by_id() · ce7f6a27

由 Ilya Dryomov 提交于 1月 27, 2014

"Lookup pool info by ID" function is hidden in osdmap.c.  Expose it to
the rest of libceph.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

ce7f6a27

libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg() · 7c13cb64

由 Ilya Dryomov 提交于 1月 27, 2014

Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

7c13cb64

libceph: introduce and start using oid abstraction · 4295f221

由 Ilya Dryomov 提交于 1月 27, 2014

In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

4295f221

libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN · 2d0ebc5d

由 Ilya Dryomov 提交于 1月 27, 2014

In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

2d0ebc5d

libceph: start using oloc abstraction · 22116525

由 Ilya Dryomov 提交于 1月 27, 2014

Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

22116525

26 1月, 2014 2 次提交

libceph: dout() is missing a newline · 0b4af2e8

由 Ilya Dryomov 提交于 1月 16, 2014

Add a missing newline to a dout() in __reset_osd().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

0b4af2e8

libceph: add ceph_kv{malloc,free}() and switch to them · eeb0bed5

由 Ilya Dryomov 提交于 1月 09, 2014

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

eeb0bed5

14 1月, 2014 3 次提交

libceph: fix preallocation check in get_reply() · f2be82b0

由 Ilya Dryomov 提交于 1月 09, 2014

The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer.  Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to

[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)

and, through another bug, leads to forever hung tasks and forced
reboots.  Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.

Fixes: http://tracker.ceph.com/issues/5425Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

f2be82b0

libceph: rename front to front_len in get_reply() · 3f0a4ac5

由 Ilya Dryomov 提交于 1月 09, 2014

Rename front local variable to front_len in get_reply() to make its
purpose more clear.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

3f0a4ac5

libceph: rename ceph_msg::front_max to front_alloc_len · 3cea4c30

由 Ilya Dryomov 提交于 1月 09, 2014

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

3cea4c30

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功