1. 07 7月, 2017 1 次提交
    • I
      crush: implement weight and id overrides for straw2 · 069f3222
      Ilya Dryomov 提交于
      bucket_straw2_choose needs to use weights that may be different from
      weight_items. For instance to compensate for an uneven distribution
      caused by a low number of values. Or to fix the probability biais
      introduced by conditional probabilities (see
      http://tracker.ceph.com/issues/15653 for more information).
      
      We introduce a weight_set for each straw2 bucket to set the desired
      weight for a given item at a given position. The weight of a given item
      when picking the first replica (first position) may be different from
      the weight the second replica (second position). For instance the weight
      matrix for a given bucket containing items 3, 7 and 13 could be as
      follows:
      
                position 0   position 1
      
      item 3     0x10000      0x100000
      item 7     0x40000       0x10000
      item 13    0x40000       0x10000
      
      When crush_do_rule picks the first of two replicas (position 0), item 7,
      3 are four times more likely to be choosen by bucket_straw2_choose than
      item 13. When choosing the second replica (position 1), item 3 is ten
      times more likely to be choosen than item 7, 13.
      
      By default the weight_set of each bucket exactly matches the content of
      item_weights for each position to ensure backward compatibility.
      
      bucket_straw2_choose compares items by using their id. The same ids are
      also used to index buckets and they must be unique. For each item in a
      bucket an array of ids can be provided for placement purposes and they
      are used instead of the ids. If no replacement ids are provided, the
      legacy behavior is preserved.
      
      Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      069f3222
  2. 24 2月, 2017 2 次提交
  3. 20 2月, 2017 2 次提交
  4. 13 12月, 2016 1 次提交
    • T
      crush: include mapper.h in mapper.c · f6c0d1a3
      Tobias Klauser 提交于
      Include linux/crush/mapper.h in crush/mapper.c to get the prototypes of
      crush_find_rule and crush_do_rule which are defined there. This fixes
      the following GCC warnings when building with 'W=1':
      
        net/ceph/crush/mapper.c:40:5: warning: no previous prototype for ‘crush_find_rule’ [-Wmissing-prototypes]
        net/ceph/crush/mapper.c:793:5: warning: no previous prototype for ‘crush_do_rule’ [-Wmissing-prototypes]
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      [idryomov@gmail.com: corresponding !__KERNEL__ include]
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f6c0d1a3
  5. 06 10月, 2016 2 次提交
  6. 05 2月, 2016 3 次提交
  7. 25 6月, 2015 2 次提交
    • I
      crush: sync up with userspace · b459be73
      Ilya Dryomov 提交于
      .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
      between kernel and userspace").  This fixes a bunch of recently pulled
      coding style issues and makes includes a bit cleaner.
      
      A patch "crush:Make the function crush_ln static" from Nicholas Krause
      <xerofoify@gmail.com> is folded in as crush_ln() has been made static
      in userspace as well.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b459be73
    • I
      crush: fix crash from invalid 'take' argument · 8f529795
      Ilya Dryomov 提交于
      Verify that the 'take' argument is a valid device or bucket.
      Otherwise ignore it (do not add the value to the working vector).
      
      Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      8f529795
  8. 22 4月, 2015 3 次提交
    • I
      crush: straw2 bucket type with an efficient 64-bit crush_ln() · 958a2765
      Ilya Dryomov 提交于
      This is an improved straw bucket that correctly avoids any data movement
      between items A and B when neither A nor B's weights are changed.  Said
      differently, if we adjust the weight of item C (including adding it anew
      or removing it completely), we will only see inputs move to or from C,
      never between other items in the bucket.
      
      Notably, there is not intermediate scaling factor that needs to be
      calculated.  The mapping function is a simple function of the item weights.
      
      The below commits were squashed together into this one (mostly to avoid
      adding and then yanking a ~6000 lines worth of crush_ln_table):
      
      - crush: add a straw2 bucket type
      - crush: add crush_ln to calculate nature log efficently
      - crush: improve straw2 adjustment slightly
      - crush: change crush_ln to provide 32 more digits
      - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
      - crush/mapper: fix divide-by-0 in straw2
        (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
         to create a proper compat.h in ceph.git)
      
      Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
                                32a1ead92efcd351822d22a5fc37d159c65c1338,
                                6289912418c4a3597a11778bcf29ed5415117ad9,
                                35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
                                6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
                                b5921d55d16796e12d66ad2c4add7305f9ce2353.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      958a2765
    • I
      crush: ensuring at most num-rep osds are selected · 45002267
      Ilya Dryomov 提交于
      Crush temporary buffers are allocated as per replica size configured
      by the user.  When there are more final osds (to be selected as per
      rule) than the replicas, buffer overlaps and it causes crash.  Now, it
      ensures that at most num-rep osds are selected even if more number of
      osds are allowed by the rule.
      
      Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
                                234b066ba04976783d15ff2abc3e81b6cc06fb10.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      45002267
    • I
      crush: drop unnecessary include from mapper.c · 9be6df21
      Ilya Dryomov 提交于
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9be6df21
  9. 05 4月, 2014 4 次提交
    • I
      crush: add SET_CHOOSELEAF_VARY_R step · d83ed858
      Ilya Dryomov 提交于
      This lets you adjust the vary_r tunable on a per-rule basis.
      
      Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      d83ed858
    • I
      crush: add chooseleaf_vary_r tunable · e2b149cc
      Ilya Dryomov 提交于
      The current crush_choose_firstn code will re-use the same 'r' value for
      the recursive call.  That means that if we are hitting a collision or
      rejection for some reason (say, an OSD that is marked out) and need to
      retry, we will keep making the same (bad) choice in that recursive
      selection.
      
      Introduce a tunable that fixes that behavior by incorporating the parent
      'r' value into the recursive starting point, so that a different path
      will be taken in subsequent placement attempts.
      
      Note that this was done from the get-go for the new crush_choose_indep
      algorithm.
      
      This was exposed by a user who was seeing PGs stuck in active+remapped
      after reweight-by-utilization because the up set mapped to a single OSD.
      
      Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e2b149cc
    • I
      crush: allow crush rules to set (re)tries counts to 0 · 6ed1002f
      Ilya Dryomov 提交于
      These two fields are misnomers; they are *retry* counts.
      
      Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      6ed1002f
    • I
      crush: fix off-by-one errors in total_tries refactor · 48a163db
      Ilya Dryomov 提交于
      Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
      code to allow adjustment of the retry counts on a per-pool basis.  That
      commit had an off-by-one bug: the previous "tries" counter was a *retry*
      count, not a *try* count, but the new code was passing in 1 meaning
      there should be no retries.
      
      Fix the ftotal vs tries comparison to use < instead of <= to fix the
      problem.  Note that the original code used <= here, which means the
      global "choose_total_tries" tunable is actually counting retries.
      Compensate for that by adding 1 in crush_do_rule when we pull the tunable
      into the local variable.
      
      This was noticed looking at output from a user provided osdmap.
      Unfortunately the map doesn't illustrate the change in mapping behavior
      and I haven't managed to construct one yet that does.  Inspection of the
      crush debug output now aligns with prior versions, though.
      
      Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      48a163db
  10. 01 1月, 2014 18 次提交
  11. 18 1月, 2013 2 次提交
    • S
      crush: avoid recursion if we have already collided · 7d7c1f61
      Sage Weil 提交于
      This saves us some cycles, but does not affect the placement result at
      all.
      
      This corresponds to ceph.git commit 4abb53d4f.
      Signed-off-by: NSage Weil <sage@inktank.com>
      7d7c1f61
    • J
      libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed · 1604f488
      Jim Schutt 提交于
      Add libceph support for a new CRUSH tunable recently added to Ceph servers.
      
      Consider the CRUSH rule
        step chooseleaf firstn 0 type <node_type>
      
      This rule means that <n> replicas will be chosen in a manner such that
      each chosen leaf's branch will contain a unique instance of <node_type>.
      
      When an object is re-replicated after a leaf failure, if the CRUSH map uses
      a chooseleaf rule the remapped replica ends up under the <node_type> bucket
      that held the failed leaf.  This causes uneven data distribution across the
      storage cluster, to the point that when all the leaves but one fail under a
      particular <node_type> bucket, that remaining leaf holds all the data from
      its failed peers.
      
      This behavior also limits the number of peers that can participate in the
      re-replication of the data held by the failed leaf, which increases the
      time required to re-replicate after a failure.
      
      For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
      inner and outer descents.
      
      If the tree descent down to <node_type> is the outer descent, and the descent
      from <node_type> down to a leaf is the inner descent, the issue is that a
      down leaf is detected on the inner descent, so only the inner descent is
      retried.
      
      In order to disperse re-replicated data as widely as possible across a
      storage cluster after a failure, we want to retry the outer descent. So,
      fix up crush_choose() to allow the inner descent to return immediately on
      choosing a failed leaf.  Wire this up as a new CRUSH tunable.
      
      Note that after this change, for a chooseleaf rule, if the primary OSD
      in a placement group has failed, choosing a replacement may result in
      one of the other OSDs in the PG colliding with the new primary.  This
      requires that OSD's data for that PG to need moving as well.  This
      seems unavoidable but should be relatively rare.
      
      This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1604f488