1. 31 5月, 2016 1 次提交
  2. 26 5月, 2016 14 次提交
    • I
      libceph: replace ceph_monc_request_next_osdmap() · 7cca78c9
      Ilya Dryomov 提交于
      ... with a wrapper around maybe_request_map() - no need for two
      osdmap-specific functions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7cca78c9
    • I
      libceph: pool deletion detection · 4609245e
      Ilya Dryomov 提交于
      This adds the "map check" infrastructure for sending osdmap version
      checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
      -ENOENT if the target pool doesn't exist or has just been deleted.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      4609245e
    • I
      libceph: support for checking on status of watch · b07d3c4b
      Ilya Dryomov 提交于
      Implement ceph_osdc_watch_check() to be able to check on status of
      watch.  Note that the time it takes for a watch/notify event to get
      delivered through the notify_wq is taken into account.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b07d3c4b
    • I
      libceph: support for sending notifies · 19079203
      Ilya Dryomov 提交于
      Implement ceph_osdc_notify() for sending notifies.
      
      Due to the fact that the current messenger can't do read-in into
      pagelists (it can only do write-out from them), I had to go with a page
      vector for a NOTIFY_COMPLETE payload, for now.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      19079203
    • I
      libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61
      Ilya Dryomov 提交于
      This adds support and switches rbd to a new, more reliable version of
      watch/notify protocol.  As with the OSD client update, this is mostly
      about getting the right structures linked into the right places so that
      reconnects are properly sent when needed.  watch/notify v2 also
      requires sending regular pings to the OSDs - send_linger_ping().
      
      A major change from the old watch/notify implementation is the
      introduction of ceph_osd_linger_request - linger requests no longer
      piggy back on ceph_osd_request.  ceph_osd_event has been merged into
      ceph_osd_linger_request.
      
      All the details are now hidden within libceph, the interface consists
      of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
      ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
      the lifetime management simple.
      
      ceph_osdc_notify_ack() accepts an optional data payload, which is
      relayed back to the notifier.
      
      Portions of this patch are loosely based on work by Douglas Fuller
      <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      922dab61
    • I
      libceph: a major OSD client update · 5aea3dcd
      Ilya Dryomov 提交于
      This is a major sync up, up to ~Jewel.  The highlights are:
      
      - per-session request trees (vs a global per-client tree)
      - per-session locking (vs a global per-client rwlock)
      - homeless OSD session
      - no ad-hoc global per-client lists
      - support for pool quotas
      - foundation for watch/notify v2 support
      - foundation for map check (pool deletion detection) support
      
      The switchover is incomplete: lingering requests can be setup and
      teared down but aren't ever reestablished.  This functionality is
      restored with the introduction of the new lingering infrastructure
      (ceph_osd_linger_request, linger_work, etc) in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5aea3dcd
    • I
      libceph: protect osdc->osd_lru list with a spinlock · 9dd2845c
      Ilya Dryomov 提交于
      OSD client is getting moved from the big per-client lock to a set of
      per-session locks.  The big rwlock would only be held for read most of
      the time, so a global osdc->osd_lru needs additional protection.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9dd2845c
    • I
      libceph: redo callbacks and factor out MOSDOpReply decoding · fe5da05e
      Ilya Dryomov 提交于
      If you specify ACK | ONDISK and set ->r_unsafe_callback, both
      ->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
      very confusing.  Redo this so that only one of them is called:
      
          ->r_unsafe_callback(true), on ack
          ->r_unsafe_callback(false), on commit
      
      or
      
          ->r_callback, on ack|commit
      
      Decode everything in decode_MOSDOpReply() to reduce clutter.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fe5da05e
    • I
      libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe
      Ilya Dryomov 提交于
      finish_read(), its only user, uses it to get to hdr.data_len, which is
      what ->r_result is set to on success.  This gains us the ability to
      safely call callbacks from contexts other than reply, e.g. map check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      85e084fe
    • I
      libceph: switch to calc_target(), part 2 · bb873b53
      Ilya Dryomov 提交于
      The crux of this is getting rid of ceph_osdc_build_request(), so that
      MOSDOp can be encoded not before but after calc_target() calculates the
      actual target.  Encoding now happens within ceph_osdc_start_request().
      
      Also nuked is the accompanying bunch of pointers into the encoded
      buffer that was used to update fields on each send - instead, the
      entire front is re-encoded.  If we want to support target->name_len !=
      base->name_len in the future, there is no other way, because oid is
      surrounded by other fields in the encoded buffer.
      
      Encoding OSD ops and adding data items to the request message were
      mixed together in osd_req_encode_op().  While we want to re-encode OSD
      ops, we don't want to add duplicate data items to the message when
      resending, so all call to ceph_osdc_msg_data_add() are factored out
      into a new setup_request_data().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bb873b53
    • I
      libceph: switch to calc_target(), part 1 · a66dd383
      Ilya Dryomov 提交于
      Replace __calc_request_pg() and most of __map_request() with
      calc_target() and start using req->r_t.
      
      ceph_osdc_build_request() however still encodes base_oid, because it's
      called before calc_target() is and target_oid is empty at that point in
      time; a printf in osdc_show() also shows base_oid.  This is fixed in
      "libceph: switch to calc_target(), part 2".
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a66dd383
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: nuke unused fields and functions · 0c0a8de1
      Ilya Dryomov 提交于
      Either unused or useless:
      
          osdmap->mkfs_epoch
          osd->o_marked_for_keepalive
          monc->num_generic_requests
          osdc->map_waiters
          osdc->last_requested_map
          osdc->timeout_tid
      
          osd_req_op_cls_response_data()
      
          osdmap_apply_incremental() @msgr arg
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0c0a8de1
    • I
      libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16
      Ilya Dryomov 提交于
      The size of ->r_request and ->r_reply messages depends on the size of
      the object name (ceph_object_id), while the size of ceph_osd_request is
      fixed.  Move message allocation into a separate function that would
      have to be called after ceph_object_id and ceph_object_locator (which
      is also going to become variable in size with RADOS namespaces) have
      been filled in:
      
          req = ceph_osdc_alloc_request(...);
          <fill in req->r_base_oid>
          <fill in req->r_base_oloc>
          ceph_osdc_alloc_messages(req);
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      13d1ad16
  3. 26 4月, 2016 1 次提交
    • I
      libceph: make authorizer destruction independent of ceph_auth_client · 6c1ea260
      Ilya Dryomov 提交于
      Starting the kernel client with cephx disabled and then enabling cephx
      and restarting userspace daemons can result in a crash:
      
          [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
          [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262671.584334] PGD 0
          [262671.635847] Oops: 0000 [#1] SMP
          [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
          [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
          [262672.268937] Workqueue: ceph-msgr con_work [libceph]
          [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
          [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
          [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
          [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
          [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
          [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
          [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
          [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
          [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
          [262673.417556] Stack:
          [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
          [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
          [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
          [262673.805230] Call Trace:
          [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
          [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
          [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
          [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
          [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
          [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
          [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
          [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
          [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
          [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
          [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
          [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
          [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
          [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
          [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
          [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
          [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
          [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
      
      What happens is the following:
      
          (1) new MON session is established
          (2) old "none" ac is destroyed
          (3) new "cephx" ac is constructed
          ...
          (4) old OSD session (w/ "none" authorizer) is put
                ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
      
      osd->o_auth.authorizer in the "none" case is just a bare pointer into
      ac, which contains a single static copy for all services.  By the time
      we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
      a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
      so we end up trying to destroy a "none" authorizer with a "cephx"
      destructor operating on invalid memory!
      
      To fix this, decouple authorizer destruction from ac and do away with
      a single static "none" authorizer by making a copy for each OSD or MDS
      session.  Authorizers themselves are independent of ac and so there is
      no reason for destroy_authorizer() to be an ac op.  Make it an op on
      the authorizer itself by turning ceph_authorizer into a real struct.
      
      Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6c1ea260
  4. 26 3月, 2016 4 次提交
  5. 25 6月, 2015 1 次提交
  6. 09 1月, 2015 1 次提交
  7. 18 12月, 2014 2 次提交
  8. 08 7月, 2014 4 次提交
  9. 03 4月, 2014 3 次提交
  10. 28 1月, 2014 5 次提交
  11. 14 12月, 2013 1 次提交
    • J
      libceph: block I/O when PAUSE or FULL osd map flags are set · d29adb34
      Josh Durgin 提交于
      The PAUSEWR and PAUSERD flags are meant to stop the cluster from
      processing writes and reads, respectively. The FULL flag is set when
      the cluster determines that it is out of space, and will no longer
      process writes.  PAUSEWR and PAUSERD are purely client-side settings
      already implemented in userspace clients. The osd does nothing special
      with these flags.
      
      When the FULL flag is set, however, the osd responds to all writes
      with -ENOSPC. For cephfs, this makes sense, but for rbd the block
      layer translates this into EIO.  If a cluster goes from full to
      non-full quickly, a filesystem on top of rbd will not behave well,
      since some writes succeed while others get EIO.
      
      Fix this by blocking any writes when the FULL flag is set in the osd
      client. This is the same strategy used by userspace, so apply it by
      default.  A follow-on patch makes this configurable.
      
      __map_request() is called to re-target osd requests in case the
      available osds changed.  Add a paused field to a ceph_osd_request, and
      set it whenever an appropriate osd map flag is set.  Avoid queueing
      paused requests in __map_request(), but force them to be resent if
      they become unpaused.
      
      Also subscribe to the next osd map from the monitor if any of these
      flags are set, so paused requests can be unblocked as soon as
      possible.
      
      Fixes: http://tracker.ceph.com/issues/6079Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
      d29adb34
  12. 10 9月, 2013 1 次提交
  13. 04 7月, 2013 1 次提交
  14. 03 5月, 2013 1 次提交