1. 25 6月, 2015 1 次提交
    • I
      crush: sync up with userspace · b459be73
      Ilya Dryomov 提交于
      .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
      between kernel and userspace").  This fixes a bunch of recently pulled
      coding style issues and makes includes a bit cleaner.
      
      A patch "crush:Make the function crush_ln static" from Nicholas Krause
      <xerofoify@gmail.com> is folded in as crush_ln() has been made static
      in userspace as well.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b459be73
  2. 22 4月, 2015 1 次提交
    • I
      crush: straw2 bucket type with an efficient 64-bit crush_ln() · 958a2765
      Ilya Dryomov 提交于
      This is an improved straw bucket that correctly avoids any data movement
      between items A and B when neither A nor B's weights are changed.  Said
      differently, if we adjust the weight of item C (including adding it anew
      or removing it completely), we will only see inputs move to or from C,
      never between other items in the bucket.
      
      Notably, there is not intermediate scaling factor that needs to be
      calculated.  The mapping function is a simple function of the item weights.
      
      The below commits were squashed together into this one (mostly to avoid
      adding and then yanking a ~6000 lines worth of crush_ln_table):
      
      - crush: add a straw2 bucket type
      - crush: add crush_ln to calculate nature log efficently
      - crush: improve straw2 adjustment slightly
      - crush: change crush_ln to provide 32 more digits
      - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
      - crush/mapper: fix divide-by-0 in straw2
        (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
         to create a proper compat.h in ceph.git)
      
      Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
                                32a1ead92efcd351822d22a5fc37d159c65c1338,
                                6289912418c4a3597a11778bcf29ed5415117ad9,
                                35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
                                6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
                                b5921d55d16796e12d66ad2c4add7305f9ce2353.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      958a2765
  3. 05 4月, 2014 2 次提交
    • I
      crush: add SET_CHOOSELEAF_VARY_R step · d83ed858
      Ilya Dryomov 提交于
      This lets you adjust the vary_r tunable on a per-rule basis.
      
      Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      d83ed858
    • I
      crush: add chooseleaf_vary_r tunable · e2b149cc
      Ilya Dryomov 提交于
      The current crush_choose_firstn code will re-use the same 'r' value for
      the recursive call.  That means that if we are hitting a collision or
      rejection for some reason (say, an OSD that is marked out) and need to
      retry, we will keep making the same (bad) choice in that recursive
      selection.
      
      Introduce a tunable that fixes that behavior by incorporating the parent
      'r' value into the recursive starting point, so that a different path
      will be taken in subsequent placement attempts.
      
      Note that this was done from the get-go for the new crush_choose_indep
      algorithm.
      
      This was exposed by a user who was seeing PGs stuck in active+remapped
      after reweight-by-utilization because the up set mapped to a single OSD.
      
      Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      e2b149cc
  4. 01 1月, 2014 10 次提交
  5. 18 1月, 2013 1 次提交
    • J
      libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed · 1604f488
      Jim Schutt 提交于
      Add libceph support for a new CRUSH tunable recently added to Ceph servers.
      
      Consider the CRUSH rule
        step chooseleaf firstn 0 type <node_type>
      
      This rule means that <n> replicas will be chosen in a manner such that
      each chosen leaf's branch will contain a unique instance of <node_type>.
      
      When an object is re-replicated after a leaf failure, if the CRUSH map uses
      a chooseleaf rule the remapped replica ends up under the <node_type> bucket
      that held the failed leaf.  This causes uneven data distribution across the
      storage cluster, to the point that when all the leaves but one fail under a
      particular <node_type> bucket, that remaining leaf holds all the data from
      its failed peers.
      
      This behavior also limits the number of peers that can participate in the
      re-replication of the data held by the failed leaf, which increases the
      time required to re-replicate after a failure.
      
      For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
      inner and outer descents.
      
      If the tree descent down to <node_type> is the outer descent, and the descent
      from <node_type> down to a leaf is the inner descent, the issue is that a
      down leaf is detected on the inner descent, so only the inner descent is
      retried.
      
      In order to disperse re-replicated data as widely as possible across a
      storage cluster after a failure, we want to retry the outer descent. So,
      fix up crush_choose() to allow the inner descent to return immediately on
      choosing a failed leaf.  Wire this up as a new CRUSH tunable.
      
      Note that after this change, for a chooseleaf rule, if the primary OSD
      in a placement group has failed, choosing a replacement may result in
      one of the other OSDs in the PG colliding with the new primary.  This
      requires that OSD's data for that PG to need moving as well.  This
      seems unavoidable but should be relatively rare.
      
      This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1604f488
  6. 03 10月, 2012 1 次提交
  7. 31 7月, 2012 1 次提交
  8. 08 5月, 2012 4 次提交
  9. 21 10月, 2010 1 次提交
    • Y
      ceph: factor out libceph from Ceph file system · 3d14c5d2
      Yehuda Sadeh 提交于
      This factors out protocol and low-level storage parts of ceph into a
      separate libceph module living in net/ceph and include/linux/ceph.  This
      is mostly a matter of moving files around.  However, a few key pieces
      of the interface change as well:
      
       - ceph_client becomes ceph_fs_client and ceph_client, where the latter
         captures the mon and osd clients, and the fs_client gets the mds client
         and file system specific pieces.
       - Mount option parsing and debugfs setup is correspondingly broken into
         two pieces.
       - The mon client gets a generic handler callback for otherwise unknown
         messages (mds map, in this case).
       - The basic supported/required feature bits can be expanded (and are by
         ceph_fs_client).
      
      No functional change, aside from some subtle error handling cases that got
      cleaned up in the refactoring process.
      Signed-off-by: NSage Weil <sage@newdream.net>
      3d14c5d2