提交 · a90c6ac2b5651b1f907de512c2fa648c9fa6bb6e · openeuler / raspberrypi-kernel

17 7月, 2017 3 次提交

libceph: use alloc_pg_mapping() in __decode_pg_upmap_items() · f5cc6898

由 Ilya Dryomov 提交于 7月 07, 2017

... otherwise we die in insert_pg_mapping(), which wants pg->node to be
empty, i.e. initialized with RB_CLEAR_NODE.

Fixes: 6f428df4 ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

f5cc6898

libceph: set -EINVAL in one place in crush_decode() · c2acfd95

由 Ilya Dryomov 提交于 7月 13, 2017

No sooner than Dan had fixed this issue in commit 293dffaa
("libceph: NULL deref on crush_decode() error path"), I brought it
back.  Add a new label and set -EINVAL once, right before failing.

Fixes: 278b1d70 ("libceph: ceph_decode_skip_* helpers")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

c2acfd95

libceph: NULL deref on osdmap_apply_incremental() error path · 00c8ebb3

由 Dan Carpenter 提交于 7月 13, 2017

There are hidden gotos in the ceph_decode_* macros.  We need to set the
"err" variable on these error paths otherwise we end up returning
ERR_PTR(0) which is NULL.  It causes NULL dereferences in the callers.

Fixes: 6f428df4 ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
[idryomov@gmail.com: similar bug in osdmap_decode(), changelog tweak]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

00c8ebb3

07 7月, 2017 16 次提交

I
libceph: osd_state is 32 bits wide in luminous · 0bb05da2
由 Ilya Dryomov 提交于 6月 22, 2017
```
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
0bb05da2

libceph, crush: per-pool crush_choose_arg_map for crush_do_rule() · 5cf9c4a9

由 Ilya Dryomov 提交于 6月 22, 2017

If there is no crush_choose_arg_map for a given pool, a NULL pointer is
passed to preserve existing crush_do_rule() behavior.

Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b,
dbe36e08be00c6519a8c89718dd47b0219c20516.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

5cf9c4a9

crush: implement weight and id overrides for straw2 · 069f3222

由 Ilya Dryomov 提交于 6月 22, 2017

bucket_straw2_choose needs to use weights that may be different from
weight_items. For instance to compensate for an uneven distribution
caused by a low number of values. Or to fix the probability biais
introduced by conditional probabilities (see
http://tracker.ceph.com/issues/15653 for more information).

We introduce a weight_set for each straw2 bucket to set the desired
weight for a given item at a given position. The weight of a given item
when picking the first replica (first position) may be different from
the weight the second replica (second position). For instance the weight
matrix for a given bucket containing items 3, 7 and 13 could be as
follows:

          position 0   position 1

item 3     0x10000      0x100000
item 7     0x40000       0x10000
item 13    0x40000       0x10000

When crush_do_rule picks the first of two replicas (position 0), item 7,
3 are four times more likely to be choosen by bucket_straw2_choose than
item 13. When choosing the second replica (position 1), item 3 is ten
times more likely to be choosen than item 7, 13.

By default the weight_set of each bucket exactly matches the content of
item_weights for each position to ensure backward compatibility.

bucket_straw2_choose compares items by using their id. The same ids are
also used to index buckets and they must be unique. For each item in a
bucket an array of ids can be provided for placement purposes and they
are used instead of the ids. If no replacement ids are provided, the
legacy behavior is preserved.

Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

069f3222

libceph: apply_upmap() · 1c2e7b45

由 Ilya Dryomov 提交于 6月 21, 2017

Previously, pg_to_raw_osds() didn't filter for existent OSDs because
raw_to_up_osds() would filter for "up" ("up" is predicated on "exists")
and raw_to_up_osds() was called directly after pg_to_raw_osds().  Now,
with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds()
output can affect apply_upmap().  Introduce remove_nonexistent_osds()
to deal with that.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1c2e7b45

libceph: compute actual pgid in ceph_pg_to_up_acting_osds() · 463bb8da

由 Ilya Dryomov 提交于 6月 21, 2017

Move raw_pg_to_pg() call out of get_temp_osds() and into
ceph_pg_to_up_acting_osds(), for upcoming apply_upmap().
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

463bb8da

libceph: pg_upmap[_items] infrastructure · 6f428df4

由 Ilya Dryomov 提交于 6月 21, 2017

pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap.  (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)

Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f428df4

libceph: ceph_decode_skip_* helpers · 278b1d70

由 Ilya Dryomov 提交于 6月 21, 2017

Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.

Replace skip_name_map() with ceph_decode_skip_map as an example.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

278b1d70

libceph: kill __{insert,lookup,remove}_pg_mapping() · ab75144b

由 Ilya Dryomov 提交于 6月 21, 2017

Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping().
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

ab75144b

I
libceph: introduce and switch to decode_pg_mapping() · a303bb0e
由 Ilya Dryomov 提交于 6月 21, 2017
```
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
a303bb0e

libceph: don't pass pgid by value · 33333d10

由 Ilya Dryomov 提交于 6月 21, 2017

Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping
counterparts: take const struct ceph_pg *.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

33333d10

I
libceph: respect RADOS_BACKOFF backoffs · a02a946d
由 Ilya Dryomov 提交于 6月 19, 2017
```
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
a02a946d
I
libceph: avoid unnecessary pi lookups in calc_target() · df28152d
由 Ilya Dryomov 提交于 6月 15, 2017
```
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
df28152d

libceph: resend on PG splits if OSD has RESEND_ON_SPLIT · 7de030d6

由 Ilya Dryomov 提交于 6月 15, 2017

Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

7de030d6

libceph: introduce ceph_spg, ceph_pg_to_primary_shard() · dc98ff72

由 Ilya Dryomov 提交于 6月 15, 2017

Store both raw pgid and actual spgid in ceph_osd_request_target.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

dc98ff72

libceph: new pi->last_force_request_resend · 8e48cf00

由 Ilya Dryomov 提交于 6月 05, 2017

The old (v15) pi->last_force_request_resend has been repurposed to
make pre-RESEND_ON_SPLIT clients that don't check for PG splits but do
obey pi->last_force_request_resend resend on splits.  See ceph.git
commit 189ca7ec6420 ("mon/OSDMonitor: make pre-luminous clients resend
ops on split").
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8e48cf00

I
libceph: handle non-empty dest in ceph_{oloc,oid}_copy() · ca35ffea
由 Ilya Dryomov 提交于 6月 05, 2017
```
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
```
ca35ffea

24 5月, 2017 1 次提交

libceph: NULL deref on crush_decode() error path · 293dffaa

由 Dan Carpenter 提交于 5月 23, 2017

If there is not enough space then ceph_decode_32_safe() does a goto bad.
We need to return an error code in that situation.  The current code
returns ERR_PTR(0) which is NULL.  The callers are not expecting that
and it results in a NULL dereference.

Fixes: f24e9980 ("ceph: OSD client")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

293dffaa

07 3月, 2017 2 次提交

libceph: don't set weight to IN when OSD is destroyed · b581a585

由 Ilya Dryomov 提交于 3月 01, 2017

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs. Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c5328 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

b581a585

libceph: fix crush_decode() for older maps · 9afd30db

由 Ilya Dryomov 提交于 2月 28, 2017

Older (shorter) CRUSH maps too need to be finalized.

Fixes: 66a0e2d5 ("crush: remove mutable part of CRUSH map")
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

9afd30db

20 2月, 2017 4 次提交

libceph: don't go through with the mapping if the PG is too wide · ef9324bb

由 Ilya Dryomov 提交于 2月 08, 2017

With EC overwrites maturing, the kernel client will be getting exposed
to potentially very wide EC pools. While "min(pi->size, X)" works fine
when the cluster is stable and happy, truncating OSD sets interferes
with resend logic (ceph_is_new_interval(), etc). Abort the mapping if
the pool is too wide, assigning the request to the homeless session.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NSage Weil <sage@redhat.com>

ef9324bb

crush: merge working data and scratch · 743efcff

由 Ilya Dryomov 提交于 1月 31, 2017

Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.

Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

743efcff

crush: remove mutable part of CRUSH map · 66a0e2d5

由 Ilya Dryomov 提交于 1月 31, 2017

Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.

Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

66a0e2d5

libceph: add osdmap_set_crush() helper · 1b6a78b5

由 Ilya Dryomov 提交于 1月 31, 2017

Simplify osdmap_decode() and osdmap_apply_incremental() a bit.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

1b6a78b5

28 7月, 2016 2 次提交

libceph: rados pool namespace support · 30c156d9

由 Yan, Zheng 提交于 2月 14, 2016

Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: NYan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

30c156d9

libceph: define new ceph_file_layout structure · 7627151e

由 Yan, Zheng 提交于 2月 03, 2016

Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: NYan, Zheng <zyan@redhat.com>

7627151e

22 7月, 2016 1 次提交

libceph: apply new_state before new_up_client on incrementals · 930c5328

由 Ilya Dryomov 提交于 7月 19, 2016

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Cc: stable@vger.kernel.org # 3.15+: 6dd74e44: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

930c5328

31 5月, 2016 1 次提交

libceph: use %s instead of %pE in dout()s · 4a3262b1

由 Ilya Dryomov 提交于 5月 30, 2016

Commit d30291b9 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.

However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored. The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

4a3262b1

26 5月, 2016 9 次提交

libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b