1. 27 2月, 2013 2 次提交
  2. 18 1月, 2013 2 次提交
    • J
      libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed · 1604f488
      Jim Schutt 提交于
      Add libceph support for a new CRUSH tunable recently added to Ceph servers.
      
      Consider the CRUSH rule
        step chooseleaf firstn 0 type <node_type>
      
      This rule means that <n> replicas will be chosen in a manner such that
      each chosen leaf's branch will contain a unique instance of <node_type>.
      
      When an object is re-replicated after a leaf failure, if the CRUSH map uses
      a chooseleaf rule the remapped replica ends up under the <node_type> bucket
      that held the failed leaf.  This causes uneven data distribution across the
      storage cluster, to the point that when all the leaves but one fail under a
      particular <node_type> bucket, that remaining leaf holds all the data from
      its failed peers.
      
      This behavior also limits the number of peers that can participate in the
      re-replication of the data held by the failed leaf, which increases the
      time required to re-replicate after a failure.
      
      For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
      inner and outer descents.
      
      If the tree descent down to <node_type> is the outer descent, and the descent
      from <node_type> down to a leaf is the inner descent, the issue is that a
      down leaf is detected on the inner descent, so only the inner descent is
      retried.
      
      In order to disperse re-replicated data as widely as possible across a
      storage cluster after a failure, we want to retry the outer descent. So,
      fix up crush_choose() to allow the inner descent to return immediately on
      choosing a failed leaf.  Wire this up as a new CRUSH tunable.
      
      Note that after this change, for a chooseleaf rule, if the primary OSD
      in a placement group has failed, choosing a replacement may result in
      one of the other OSDs in the PG colliding with the new primary.  This
      requires that OSD's data for that PG to need moving as well.  This
      seems unavoidable but should be relatively rare.
      
      This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
      Signed-off-by: NJim Schutt <jaschut@sandia.gov>
      Reviewed-by: NSage Weil <sage@inktank.com>
      1604f488
    • S
      ceph: Check for created flag in response from mds · 6e8575fa
      Sam Lang 提交于
      The mds now sends back a created inode if the create request
      performed the create.  If the file already existed, no inode is
      returned in the reply.  This allows ceph to set the created flag
      in atomic_open so that permissions are properly checked in the case
      that the file wasn't created by the create call to the mds.
      
      To ensure compability with previous kernels, a feature for sending
      back the inode in the create reply was added, so that the mds will
      only send back the inode if the client indicates it supports the
      feature.
      Signed-off-by: NSam Lang <sam.lang@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      6e8575fa
  3. 31 7月, 2012 2 次提交