1. 15 12月, 2016 2 次提交
    • I
      libceph: remove now unused finish_request() wrapper · 45ee2c1d
      Ilya Dryomov 提交于
      Kill the wrapper and rename __finish_request() to finish_request().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      45ee2c1d
    • I
      libceph: always signal completion when done · c297eb42
      Ilya Dryomov 提交于
      r_safe_completion is currently, and has always been, signaled only if
      on-disk ack was requested.  It's there for fsync and syncfs, which wait
      for in-flight writes to flush - all data write requests set ONDISK.
      
      However, the pool perm check code introduced in 4.2 sends a write
      request with only ACK set.  An unfortunately timed syncfs can then hang
      forever: r_safe_completion won't be signaled because only an unsafe
      reply was requested.
      
      We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
      that is somewhat incomplete and yet another special case.  Instead,
      rename this completion to r_done_completion and always signal it when
      the OSD client is done with the request, whether unsafe, safe, or
      error.  This is a bit cleaner and helps with the cancellation code.
      Reported-by: NYan, Zheng <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c297eb42
  2. 13 12月, 2016 1 次提交
  3. 11 11月, 2016 1 次提交
  4. 25 8月, 2016 3 次提交
  5. 09 8月, 2016 1 次提交
  6. 28 7月, 2016 3 次提交
  7. 31 5月, 2016 3 次提交
    • I
      libceph: use %s instead of %pE in dout()s · 4a3262b1
      Ilya Dryomov 提交于
      Commit d30291b9 ("libceph: variable-sized ceph_object_id") changed
      dout()s in what is now encode_request() and ceph_object_locator_to_pg()
      to use %pE, mostly to document that, although all rbd and cephfs object
      names are NULL-terminated strings, ceph_object_id will handle any RADOS
      object name, including the one containing NULs, just fine.
      
      However, it turns out that vbin_printf() can't handle anything but ints
      and %s - all %p suffixes are ignored.  The buffer %p** points to isn't
      recorded, resulting in trash in the messages if the buffer had been
      reused by the time bstr_printf() got to it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      4a3262b1
    • I
      libceph: put request only if it's done in handle_reply() · dc045a91
      Ilya Dryomov 提交于
      handle_reply() may be called twice on the same request: on ack and then
      on commit.  This occurs on btrfs-formatted OSDs or if cephfs sync write
      path is triggered - CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK.
      
      handle_reply() handles this with the help of done_request().
      
      Fixes: 5aea3dcd ("libceph: a major OSD client update")
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      dc045a91
    • I
      libceph: change ceph_osdmap_flag() to take osdc · b7ec35b3
      Ilya Dryomov 提交于
      For the benefit of every single caller, take osdc instead of map.
      Also, now that osdc->osdmap can't ever be NULL, drop the check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b7ec35b3
  8. 26 5月, 2016 26 次提交
    • Y
      libceph: make ceph_osdc_wait_request() uninterruptible · 0e76abf2
      Yan, Zheng 提交于
      Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
      cases, the sync IO should be uninterruptible. The fix is use killale
      wait function in ceph_osdc_wait_request().
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      0e76abf2
    • I
      libceph: replace ceph_monc_request_next_osdmap() · 7cca78c9
      Ilya Dryomov 提交于
      ... with a wrapper around maybe_request_map() - no need for two
      osdmap-specific functions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7cca78c9
    • I
      libceph: pool deletion detection · 4609245e
      Ilya Dryomov 提交于
      This adds the "map check" infrastructure for sending osdmap version
      checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
      -ENOENT if the target pool doesn't exist or has just been deleted.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      4609245e
    • I
      libceph: support for checking on status of watch · b07d3c4b
      Ilya Dryomov 提交于
      Implement ceph_osdc_watch_check() to be able to check on status of
      watch.  Note that the time it takes for a watch/notify event to get
      delivered through the notify_wq is taken into account.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b07d3c4b
    • I
      libceph: support for sending notifies · 19079203
      Ilya Dryomov 提交于
      Implement ceph_osdc_notify() for sending notifies.
      
      Due to the fact that the current messenger can't do read-in into
      pagelists (it can only do write-out from them), I had to go with a page
      vector for a NOTIFY_COMPLETE payload, for now.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      19079203
    • I
      libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61
      Ilya Dryomov 提交于
      This adds support and switches rbd to a new, more reliable version of
      watch/notify protocol.  As with the OSD client update, this is mostly
      about getting the right structures linked into the right places so that
      reconnects are properly sent when needed.  watch/notify v2 also
      requires sending regular pings to the OSDs - send_linger_ping().
      
      A major change from the old watch/notify implementation is the
      introduction of ceph_osd_linger_request - linger requests no longer
      piggy back on ceph_osd_request.  ceph_osd_event has been merged into
      ceph_osd_linger_request.
      
      All the details are now hidden within libceph, the interface consists
      of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
      ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
      the lifetime management simple.
      
      ceph_osdc_notify_ack() accepts an optional data payload, which is
      relayed back to the notifier.
      
      Portions of this patch are loosely based on work by Douglas Fuller
      <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      922dab61
    • I
      libceph: wait_request_timeout() · 42b06965
      Ilya Dryomov 提交于
      The unwatch timeout is currently implemented in rbd.  With
      watch/unwatch code moving into libceph, we are going to need
      a ceph_osdc_wait_request() variant with a timeout.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42b06965
    • I
      libceph: request_init() and request_release_checks() · 3540bfdb
      Ilya Dryomov 提交于
      These are going to be used by request_reinit() code.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3540bfdb
    • I
      libceph: a major OSD client update · 5aea3dcd
      Ilya Dryomov 提交于
      This is a major sync up, up to ~Jewel.  The highlights are:
      
      - per-session request trees (vs a global per-client tree)
      - per-session locking (vs a global per-client rwlock)
      - homeless OSD session
      - no ad-hoc global per-client lists
      - support for pool quotas
      - foundation for watch/notify v2 support
      - foundation for map check (pool deletion detection) support
      
      The switchover is incomplete: lingering requests can be setup and
      teared down but aren't ever reestablished.  This functionality is
      restored with the introduction of the new lingering infrastructure
      (ceph_osd_linger_request, linger_work, etc) in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5aea3dcd
    • I
      libceph: protect osdc->osd_lru list with a spinlock · 9dd2845c
      Ilya Dryomov 提交于
      OSD client is getting moved from the big per-client lock to a set of
      per-session locks.  The big rwlock would only be held for read most of
      the time, so a global osdc->osd_lru needs additional protection.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9dd2845c
    • I
      libceph: allocate ceph_osd with GFP_NOFAIL · 7a28f59b
      Ilya Dryomov 提交于
      create_osd() is called way too deep in the stack to be able to error
      out in a sane way; a failing create_osd() just messes everything up.
      The current req_notarget list solution is broken - the list is never
      traversed as it's not entirely clear when to do it, I guess.
      
      If we were to start traversing it at regular intervals and retrying
      each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
      so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
      with a better fix.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7a28f59b
    • I
      libceph: osd_init() and osd_cleanup() · 0247a0cf
      Ilya Dryomov 提交于
      These are going to be used by homeless OSD sessions code.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0247a0cf
    • I
      libceph: handle_one_map() · 42c1b124
      Ilya Dryomov 提交于
      Separate osdmap handling from decoding and iterating over a bag of maps
      in a fresh MOSDMap message.  This sets up the scene for the updated OSD
      client.
      
      Of particular importance here is the addition of pi->was_full, which
      can be used to answer "did this pool go full -> not-full in this map?".
      This is the key bit for supporting pool quotas.
      
      We won't be able to downgrade map_sem for much longer, so drop
      downgrade_write().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42c1b124
    • I
      libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b
      Ilya Dryomov 提交于
      This leads to a simpler osdmap handling code, particularly when dealing
      with pi->was_full, which is introduced in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      e5253a7b
    • I
      libceph: schedule tick from ceph_osdc_init() · fbca9635
      Ilya Dryomov 提交于
      Both homeless OSD sessions and watch/notify v2, introduced in later
      commits, require periodic ticks which don't depend on ->num_requests.
      Schedule the initial tick from ceph_osdc_init() and reschedule from
      handle_timeout() unconditionally.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fbca9635
    • I
      libceph: move schedule_delayed_work() in ceph_osdc_init() · b37ee1b9
      Ilya Dryomov 提交于
      ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
      with handle_osds_timeout() running on invalid memory if any one of the
      allocations fails.  Call schedule_delayed_work() after everything is
      setup, just before returning.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b37ee1b9
    • I
      libceph: redo callbacks and factor out MOSDOpReply decoding · fe5da05e
      Ilya Dryomov 提交于
      If you specify ACK | ONDISK and set ->r_unsafe_callback, both
      ->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
      very confusing.  Redo this so that only one of them is called:
      
          ->r_unsafe_callback(true), on ack
          ->r_unsafe_callback(false), on commit
      
      or
      
          ->r_callback, on ack|commit
      
      Decode everything in decode_MOSDOpReply() to reduce clutter.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fe5da05e
    • I
      libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe
      Ilya Dryomov 提交于
      finish_read(), its only user, uses it to get to hdr.data_len, which is
      what ->r_result is set to on success.  This gains us the ability to
      safely call callbacks from contexts other than reply, e.g. map check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      85e084fe
    • I
      libceph: switch to calc_target(), part 2 · bb873b53
      Ilya Dryomov 提交于
      The crux of this is getting rid of ceph_osdc_build_request(), so that
      MOSDOp can be encoded not before but after calc_target() calculates the
      actual target.  Encoding now happens within ceph_osdc_start_request().
      
      Also nuked is the accompanying bunch of pointers into the encoded
      buffer that was used to update fields on each send - instead, the
      entire front is re-encoded.  If we want to support target->name_len !=
      base->name_len in the future, there is no other way, because oid is
      surrounded by other fields in the encoded buffer.
      
      Encoding OSD ops and adding data items to the request message were
      mixed together in osd_req_encode_op().  While we want to re-encode OSD
      ops, we don't want to add duplicate data items to the message when
      resending, so all call to ceph_osdc_msg_data_add() are factored out
      into a new setup_request_data().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bb873b53
    • I
      libceph: switch to calc_target(), part 1 · a66dd383
      Ilya Dryomov 提交于
      Replace __calc_request_pg() and most of __map_request() with
      calc_target() and start using req->r_t.
      
      ceph_osdc_build_request() however still encodes base_oid, because it's
      called before calc_target() is and target_oid is empty at that point in
      time; a printf in osdc_show() also shows base_oid.  This is fixed in
      "libceph: switch to calc_target(), part 2".
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a66dd383
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45
      Ilya Dryomov 提交于
      Knowning just acting set isn't enough, we need to be able to record up
      set as well to detect interval changes.  This means returning (up[],
      up_len, up_primary, acting[], acting_len, acting_primary) and passing
      it around.  Introduce and switch to ceph_osds to help with that.
      
      Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
      both up and acting sets from it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6f3bfd45
    • I
      libceph: rename ceph_oloc_oid_to_pg() · d9591f5e
      Ilya Dryomov 提交于
      Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
      that returned is raw PG and return -ENOENT instead of -EIO if the pool
      doesn't exist.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d9591f5e
    • I
      libceph: DEFINE_RB_FUNCS macro · fcd00b68
      Ilya Dryomov 提交于
      Given
      
          struct foo {
              u64 id;
              struct rb_node bar_node;
          };
      
      generate insert_bar(), erase_bar() and lookup_bar() functions with
      
          DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
      
      The key is assumed to be an integer (u64, int, etc), compared with
      < and >.  nodefld has to be initialized with RB_CLEAR_NODE().
      
      Start using it for MDS, MON and OSD requests and OSD sessions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fcd00b68
    • I
      libceph: open-code remove_{all,old}_osds() · 42a2c09f
      Ilya Dryomov 提交于
      They are called only once, from ceph_osdc_stop() and
      handle_osds_timeout() respectively.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42a2c09f
    • I
      libceph: nuke unused fields and functions · 0c0a8de1
      Ilya Dryomov 提交于
      Either unused or useless:
      
          osdmap->mkfs_epoch
          osd->o_marked_for_keepalive
          monc->num_generic_requests
          osdc->map_waiters
          osdc->last_requested_map
          osdc->timeout_tid
      
          osd_req_op_cls_response_data()
      
          osdmap_apply_incremental() @msgr arg
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0c0a8de1