提交 · 1b6a78b5b91cdc07cc0b940b458e90c86835cf73 · openanolis / cloud-kernel

20 2月, 2017 1 次提交

libceph: add osdmap_set_crush() helper · 1b6a78b5

由 Ilya Dryomov 提交于 1月 31, 2017

Simplify osdmap_decode() and osdmap_apply_incremental() a bit.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1b6a78b5

28 7月, 2016 2 次提交

libceph: rados pool namespace support · 30c156d9

由 Yan, Zheng 提交于 2月 14, 2016

Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: NYan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

30c156d9

libceph: define new ceph_file_layout structure · 7627151e

由 Yan, Zheng 提交于 2月 03, 2016

Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

7627151e

22 7月, 2016 1 次提交

libceph: apply new_state before new_up_client on incrementals · 930c5328

由 Ilya Dryomov 提交于 7月 19, 2016

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Cc: stable@vger.kernel.org # 3.15+: 6dd74e44: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

930c5328

31 5月, 2016 1 次提交

libceph: use %s instead of %pE in dout()s · 4a3262b1

由 Ilya Dryomov 提交于 5月 30, 2016

Commit d30291b9 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.

However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored. The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4a3262b1

26 5月, 2016 9 次提交

libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b

由 Ilya Dryomov 提交于 4月 28, 2016

This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

e5253a7b

libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1

由 Ilya Dryomov 提交于 4月 28, 2016

Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

63244fa1

libceph: pi->min_size, pi->last_force_request_resend · 04812acf

由 Ilya Dryomov 提交于 4月 28, 2016

Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

04812acf

libceph: make pgid_cmp() global · f984cb76

由 Ilya Dryomov 提交于 4月 28, 2016

calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f984cb76

libceph: rename ceph_calc_pg_primary() · f81f1633

由 Ilya Dryomov 提交于 4月 28, 2016

Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f81f1633

libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45

由 Ilya Dryomov 提交于 4月 28, 2016

Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f3bfd45

libceph: rename ceph_oloc_oid_to_pg() · d9591f5e

由 Ilya Dryomov 提交于 4月 28, 2016

Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d9591f5e

libceph: nuke unused fields and functions · 0c0a8de1

由 Ilya Dryomov 提交于 4月 28, 2016

Either unused or useless:

    osdmap->mkfs_epoch
    osd->o_marked_for_keepalive
    monc->num_generic_requests
    osdc->map_waiters
    osdc->last_requested_map
    osdc->timeout_tid

    osd_req_op_cls_response_data()

    osdmap_apply_incremental() @msgr arg
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

0c0a8de1

libceph: variable-sized ceph_object_id · d30291b9

由 Ilya Dryomov 提交于 4月 29, 2016

Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

d30291b9

05 2月, 2016 1 次提交

crush: decode and initialize chooseleaf_stable · b9b519b7

由 Ilya Dryomov 提交于 2月 01, 2016

Also add missing \n while at it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

b9b519b7

09 9月, 2015 1 次提交

libceph: set 'exists' flag for newly up osd · 6dd74e44

由 Yan, Zheng 提交于 8月 28, 2015

Signed-off-by: NYan, Zheng <zyan@redhat.com>
Reviewed-by: NSage Weil <sage@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6dd74e44

01 7月, 2015 1 次提交

crush: fix a bug in tree bucket decode · 82cd003a

由 Ilya Dryomov 提交于 6月 29, 2015

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Cc: stable@vger.kernel.org
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

82cd003a

22 4月, 2015 1 次提交

crush: straw2 bucket type with an efficient 64-bit crush_ln() · 958a2765

由 Ilya Dryomov 提交于 4月 14, 2015

This is an improved straw bucket that correctly avoids any data movement
between items A and B when neither A nor B's weights are changed.  Said
differently, if we adjust the weight of item C (including adding it anew
or removing it completely), we will only see inputs move to or from C,
never between other items in the bucket.

Notably, there is not intermediate scaling factor that needs to be
calculated.  The mapping function is a simple function of the item weights.

The below commits were squashed together into this one (mostly to avoid
adding and then yanking a ~6000 lines worth of crush_ln_table):

- crush: add a straw2 bucket type
- crush: add crush_ln to calculate nature log efficently
- crush: improve straw2 adjustment slightly
- crush: change crush_ln to provide 32 more digits
- crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
- crush/mapper: fix divide-by-0 in straw2
  (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
   to create a proper compat.h in ceph.git)

Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
                          32a1ead92efcd351822d22a5fc37d159c65c1338,
                          6289912418c4a3597a11778bcf29ed5415117ad9,
                          35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
                          6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
                          b5921d55d16796e12d66ad2c4add7305f9ce2353.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

958a2765

15 10月, 2014 2 次提交

libceph: Convert pr_warning to pr_warn · b9a67899

由 Joe Perches 提交于 9月 09, 2014

Use the more common pr_warn.

Other miscellanea:

o Coalesce formats
o Realign arguments
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

b9a67899

libceph: fix a use after free issue in osdmap_set_max_osd · 589506f1

由 Li RongQing 提交于 9月 07, 2014

If the state variable is krealloced successfully, map->osd_state will be
freed, once following two reallocation failed, and exit the function
without resetting map->osd_state, map->osd_state become a wild pointer.

fix it by resetting them after krealloc successfully.
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

589506f1

17 5月, 2014 1 次提交

crush: decode and initialize chooseleaf_vary_r · f140662f

由 Ilya Dryomov 提交于 5月 09, 2014

Commit e2b149cc ("crush: add chooseleaf_vary_r tunable") added the
crush_map::chooseleaf_vary_r field but missed the decode part.  This
lead to misdirected requests caused by incorrect raw crush mapping
sets.

Fixes: http://tracker.ceph.com/issues/8226Reported-and-Tested-by: NDmitry Smirnov <onlyjob@member.fsf.org>
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

f140662f

29 4月, 2014 1 次提交

libceph: fix non-default values check in apply_primary_affinity() · 92b2e751

由 Ilya Dryomov 提交于 4月 10, 2014

osd_primary_affinity array is indexed into incorrectly when checking
for non-default primary-affinity values.  This nullifies the impact of
the rest of the apply_primary_affinity() and results in misdirected
requests.

                if (osds[i] != CRUSH_ITEM_NONE &&
                    osdmap->osd_primary_affinity[i] !=
                                                ^^^
                                        CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {

For a pool with size 2, this always ends up checking osd0 and osd1
primary_affinity values, instead of the values that correspond to the
osds in question.  E.g., given a [2,3] up set and a [max,max,0,max]
primary affinity vector, requests are still sent to osd2, because both
osd0 and osd1 happen to have max primary_affinity values and therefore
we return from apply_primary_affinity() early on the premise that all
osds in the given set have max (default) values.  Fix it.

Fixes: http://tracker.ceph.com/issues/7954Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

92b2e751

05 4月, 2014 18 次提交

libceph: output primary affinity values on osdmap updates · f31da0f3

由 Ilya Dryomov 提交于 4月 02, 2014

Similar to osd weights, output primary affinity values on incremental
osdmap updates.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

f31da0f3

libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() · c4c12285

由 Ilya Dryomov 提交于 3月 24, 2014

Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

c4c12285

libceph: add support for osd primary affinity · 47ec1f3c

由 Ilya Dryomov 提交于 3月 24, 2014

Respond to non-default primary_affinity values accordingly.  (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

47ec1f3c

libceph: add support for primary_temp mappings · 5e8d4d36

由 Ilya Dryomov 提交于 3月 24, 2014

Change apply_temp() to override primary in the same way pg_temp
overrides osd set.  primary_temp overrides pg_temp primary too.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

5e8d4d36

libceph: return primary from ceph_calc_pg_acting() · 8008ab10

由 Ilya Dryomov 提交于 3月 24, 2014

In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly.  Primary is now specified separately
from the order of osds in the set.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

8008ab10

libceph: switch ceph_calc_pg_acting() to new helpers · ac972230

由 Ilya Dryomov 提交于 3月 24, 2014

Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

ac972230

libceph: introduce apply_temps() helper · 45966c34

由 Ilya Dryomov 提交于 3月 24, 2014

apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

45966c34

libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers · 2bd93d4d

由 Ilya Dryomov 提交于 3月 24, 2014

pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.

raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

2bd93d4d

libceph: primary_affinity decode bits · 63a6993f