1. 31 5月, 2016 1 次提交
  2. 26 5月, 2016 29 次提交
    • Z
      ceph: make logical calculation functions return bool · 3b33f692
      Zhang Zhuoyu 提交于
      This patch makes serverl logical caculation functions return bool to
      improve readability due to these particular functions only using 0/1
      as their return value.
      
      No functional change.
      Signed-off-by: NZhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
      3b33f692
    • Y
      ceph: using hash value to compose dentry offset · f3c4ebe6
      Yan, Zheng 提交于
      If MDS sorts dentries in dirfrag in hash order, we use hash value to
      compose dentry offset. dentry offset is:
      
        (0xff << 52) | ((24 bits hash) << 28) |
        (the nth entry hash hash collision)
      
      This offset is stable across directory fragmentation. This alos means
      there is no need to reset readdir offset if directory get fragmented
      in the middle of readdir.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      f3c4ebe6
    • Y
      ceph: define 'end/complete' in readdir reply as bit flags · 956d39d6
      Yan, Zheng 提交于
      Set a flag in readdir request, which indicates that client interprets
      'end/complete' as bit flags. So that mds can reply additional flags in
      readdir reply.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      956d39d6
    • I
      737cc81e
    • I
      libceph: replace ceph_monc_request_next_osdmap() · 7cca78c9
      Ilya Dryomov 提交于
      ... with a wrapper around maybe_request_map() - no need for two
      osdmap-specific functions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7cca78c9
    • I
      libceph: pool deletion detection · 4609245e
      Ilya Dryomov 提交于
      This adds the "map check" infrastructure for sending osdmap version
      checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
      -ENOENT if the target pool doesn't exist or has just been deleted.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      4609245e
    • I
      libceph: async MON client generic requests · d0b19705
      Ilya Dryomov 提交于
      For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
      messages asynchronously and get a callback on completion.  Refactor MON
      client to allow firing off generic requests asynchronously and add an
      async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
      switched over and remains sync.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d0b19705
    • I
      libceph: support for checking on status of watch · b07d3c4b
      Ilya Dryomov 提交于
      Implement ceph_osdc_watch_check() to be able to check on status of
      watch.  Note that the time it takes for a watch/notify event to get
      delivered through the notify_wq is taken into account.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b07d3c4b
    • I
      libceph: support for sending notifies · 19079203
      Ilya Dryomov 提交于
      Implement ceph_osdc_notify() for sending notifies.
      
      Due to the fact that the current messenger can't do read-in into
      pagelists (it can only do write-out from them), I had to go with a page
      vector for a NOTIFY_COMPLETE payload, for now.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      19079203
    • I
      libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61
      Ilya Dryomov 提交于
      This adds support and switches rbd to a new, more reliable version of
      watch/notify protocol.  As with the OSD client update, this is mostly
      about getting the right structures linked into the right places so that
      reconnects are properly sent when needed.  watch/notify v2 also
      requires sending regular pings to the OSDs - send_linger_ping().
      
      A major change from the old watch/notify implementation is the
      introduction of ceph_osd_linger_request - linger requests no longer
      piggy back on ceph_osd_request.  ceph_osd_event has been merged into
      ceph_osd_linger_request.
      
      All the details are now hidden within libceph, the interface consists
      of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
      ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
      the lifetime management simple.
      
      ceph_osdc_notify_ack() accepts an optional data payload, which is
      relayed back to the notifier.
      
      Portions of this patch are loosely based on work by Douglas Fuller
      <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      922dab61
    • I
      libceph: a major OSD client update · 5aea3dcd
      Ilya Dryomov 提交于
      This is a major sync up, up to ~Jewel.  The highlights are:
      
      - per-session request trees (vs a global per-client tree)
      - per-session locking (vs a global per-client rwlock)
      - homeless OSD session
      - no ad-hoc global per-client lists
      - support for pool quotas
      - foundation for watch/notify v2 support
      - foundation for map check (pool deletion detection) support
      
      The switchover is incomplete: lingering requests can be setup and
      teared down but aren't ever reestablished.  This functionality is
      restored with the introduction of the new lingering infrastructure
      (ceph_osd_linger_request, linger_work, etc) in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5aea3dcd
    • I
      libceph: protect osdc->osd_lru list with a spinlock · 9dd2845c
      Ilya Dryomov 提交于
      OSD client is getting moved from the big per-client lock to a set of
      per-session locks.  The big rwlock would only be held for read most of
      the time, so a global osdc->osd_lru needs additional protection.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9dd2845c
    • I
      libceph: handle_one_map() · 42c1b124
      Ilya Dryomov 提交于
      Separate osdmap handling from decoding and iterating over a bag of maps
      in a fresh MOSDMap message.  This sets up the scene for the updated OSD
      client.
      
      Of particular importance here is the addition of pi->was_full, which
      can be used to answer "did this pool go full -> not-full in this map?".
      This is the key bit for supporting pool quotas.
      
      We won't be able to downgrade map_sem for much longer, so drop
      downgrade_write().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42c1b124
    • I
      libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b
      Ilya Dryomov 提交于
      This leads to a simpler osdmap handling code, particularly when dealing
      with pi->was_full, which is introduced in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      e5253a7b
    • I
      libceph: redo callbacks and factor out MOSDOpReply decoding · fe5da05e
      Ilya Dryomov 提交于
      If you specify ACK | ONDISK and set ->r_unsafe_callback, both
      ->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
      very confusing.  Redo this so that only one of them is called:
      
          ->r_unsafe_callback(true), on ack
          ->r_unsafe_callback(false), on commit
      
      or
      
          ->r_callback, on ack|commit
      
      Decode everything in decode_MOSDOpReply() to reduce clutter.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fe5da05e
    • I
      libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe
      Ilya Dryomov 提交于
      finish_read(), its only user, uses it to get to hdr.data_len, which is
      what ->r_result is set to on success.  This gains us the ability to
      safely call callbacks from contexts other than reply, e.g. map check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      85e084fe
    • I
      libceph: switch to calc_target(), part 2 · bb873b53
      Ilya Dryomov 提交于
      The crux of this is getting rid of ceph_osdc_build_request(), so that
      MOSDOp can be encoded not before but after calc_target() calculates the
      actual target.  Encoding now happens within ceph_osdc_start_request().
      
      Also nuked is the accompanying bunch of pointers into the encoded
      buffer that was used to update fields on each send - instead, the
      entire front is re-encoded.  If we want to support target->name_len !=
      base->name_len in the future, there is no other way, because oid is
      surrounded by other fields in the encoded buffer.
      
      Encoding OSD ops and adding data items to the request message were
      mixed together in osd_req_encode_op().  While we want to re-encode OSD
      ops, we don't want to add duplicate data items to the message when
      resending, so all call to ceph_osdc_msg_data_add() are factored out
      into a new setup_request_data().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bb873b53
    • I
      libceph: switch to calc_target(), part 1 · a66dd383
      Ilya Dryomov 提交于
      Replace __calc_request_pg() and most of __map_request() with
      calc_target() and start using req->r_t.
      
      ceph_osdc_build_request() however still encodes base_oid, because it's
      called before calc_target() is and target_oid is empty at that point in
      time; a printf in osdc_show() also shows base_oid.  This is fixed in
      "libceph: switch to calc_target(), part 2".
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a66dd383
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: pi->min_size, pi->last_force_request_resend · 04812acf
      Ilya Dryomov 提交于
      Add and decode pi->min_size and pi->last_force_request_resend.  These
      are going to be used by calc_target().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      04812acf
    • I
      libceph: make pgid_cmp() global · f984cb76
      Ilya Dryomov 提交于
      calc_target() code is going to need to know how to compare PGs.  Take
      lhs and rhs pgid by const * while at it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f984cb76
    • I
      libceph: rename ceph_calc_pg_primary() · f81f1633
      Ilya Dryomov 提交于
      Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
      emphasise that it returns acting primary.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f81f1633
    • I
      libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45
      Ilya Dryomov 提交于
      Knowning just acting set isn't enough, we need to be able to record up
      set as well to detect interval changes.  This means returning (up[],
      up_len, up_primary, acting[], acting_len, acting_primary) and passing
      it around.  Introduce and switch to ceph_osds to help with that.
      
      Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
      both up and acting sets from it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6f3bfd45
    • I
      libceph: rename ceph_oloc_oid_to_pg() · d9591f5e
      Ilya Dryomov 提交于
      Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
      that returned is raw PG and return -ENOENT instead of -EIO if the pool
      doesn't exist.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d9591f5e
    • I
      libceph: fix ceph_eversion encoding · 985c1673
      Ilya Dryomov 提交于
      eversion_t is version+epoch in userspace and is encoded in that order.
      ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
      in __send_request().  Reoder ceph_eversion fields.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      985c1673
    • I
      libceph: DEFINE_RB_FUNCS macro · fcd00b68
      Ilya Dryomov 提交于
      Given
      
          struct foo {
              u64 id;
              struct rb_node bar_node;
          };
      
      generate insert_bar(), erase_bar() and lookup_bar() functions with
      
          DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
      
      The key is assumed to be an integer (u64, int, etc), compared with
      < and >.  nodefld has to be initialized with RB_CLEAR_NODE().
      
      Start using it for MDS, MON and OSD requests and OSD sessions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fcd00b68
    • I
      libceph: nuke unused fields and functions · 0c0a8de1
      Ilya Dryomov 提交于
      Either unused or useless:
      
          osdmap->mkfs_epoch
          osd->o_marked_for_keepalive
          monc->num_generic_requests
          osdc->map_waiters
          osdc->last_requested_map
          osdc->timeout_tid
      
          osd_req_op_cls_response_data()
      
          osdmap_apply_incremental() @msgr arg
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0c0a8de1
    • I
      libceph: variable-sized ceph_object_id · d30291b9
      Ilya Dryomov 提交于
      Currently ceph_object_id can hold object names of up to 100
      (CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
      expect one - long rbd image names:
      
      - a format 1 header is named "<imgname>.rbd"
      - an object that points to a format 2 header is named "rbd_id.<imgname>"
      
      We operate on these potentially long-named objects during rbd map, and,
      for format 1 images, during header refresh.  (A format 2 header name is
      a small system-generated string.)
      
      Lift this 100 character limit by making ceph_object_id be able to point
      to an externally-allocated string.  Apart from being able to work with
      almost arbitrarily-long named objects, this allows us to reduce the
      size of ceph_object_id from >100 bytes to 64 bytes.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d30291b9
    • I
      libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16
      Ilya Dryomov 提交于
      The size of ->r_request and ->r_reply messages depends on the size of
      the object name (ceph_object_id), while the size of ceph_osd_request is
      fixed.  Move message allocation into a separate function that would
      have to be called after ceph_object_id and ceph_object_locator (which
      is also going to become variable in size with RADOS namespaces) have
      been filled in:
      
          req = ceph_osdc_alloc_request(...);
          <fill in req->r_base_oid>
          <fill in req->r_base_oloc>
          ceph_osdc_alloc_messages(req);
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      13d1ad16
  3. 26 4月, 2016 1 次提交
    • I
      libceph: make authorizer destruction independent of ceph_auth_client · 6c1ea260
      Ilya Dryomov 提交于
      Starting the kernel client with cephx disabled and then enabling cephx
      and restarting userspace daemons can result in a crash:
      
          [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
          [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262671.584334] PGD 0
          [262671.635847] Oops: 0000 [#1] SMP
          [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
          [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
          [262672.268937] Workqueue: ceph-msgr con_work [libceph]
          [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
          [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
          [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
          [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
          [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
          [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
          [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
          [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
          [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
          [262673.417556] Stack:
          [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
          [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
          [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
          [262673.805230] Call Trace:
          [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
          [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
          [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
          [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
          [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
          [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
          [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
          [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
          [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
          [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
          [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
          [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
          [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
          [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
          [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
          [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
          [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
          [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
      
      What happens is the following:
      
          (1) new MON session is established
          (2) old "none" ac is destroyed
          (3) new "cephx" ac is constructed
          ...
          (4) old OSD session (w/ "none" authorizer) is put
                ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
      
      osd->o_auth.authorizer in the "none" case is just a bare pointer into
      ac, which contains a single static copy for all services.  By the time
      we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
      a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
      so we end up trying to destroy a "none" authorizer with a "cephx"
      destructor operating on invalid memory!
      
      To fix this, decouple authorizer destruction from ac and do away with
      a single static "none" authorizer by making a copy for each OSD or MDS
      session.  Authorizers themselves are independent of ac and so there is
      no reason for destroy_authorizer() to be an ac op.  Make it an op on
      the authorizer itself by turning ceph_authorizer into a real struct.
      
      Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6c1ea260
  4. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  5. 26 3月, 2016 8 次提交
    • Y
      ceph: fix security xattr deadlock · 315f2408
      Yan, Zheng 提交于
      When security is enabled, security module can call filesystem's
      getxattr/setxattr callbacks during d_instantiate(). For cephfs,
      d_instantiate() is usually called by MDS' dispatch thread, while
      handling MDS reply. If the MDS reply does not include xattrs and
      corresponding caps, getxattr/setxattr need to send a new request
      to MDS and waits for the reply. This makes MDS' dispatch sleep,
      nobody handles later MDS replies.
      
      The fix is make sure lookup/atomic_open reply include xattrs and
      corresponding caps. So getxattr can be handled by cached xattrs.
      This requires some modification to both MDS and request message.
      (Client tells MDS what caps it wants; MDS encodes proper caps in
      the reply)
      
      Smack security module may call setxattr during d_instantiate().
      Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
      to us. So just make setxattr return error when called by MDS'
      dispatch thread.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      315f2408
    • Y
      libceph: add helper that duplicates last extent operation · 2c63f49a
      Yan, Zheng 提交于
      This helper duplicates last extent operation in OSD request, then
      adjusts the new extent operation's offset and length. The helper
      is for scatterd page writeback, which adds nonconsecutive dirty
      pages to single OSD request.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      2c63f49a
    • I
      libceph: enable large, variable-sized OSD requests · 3f1af42a
      Ilya Dryomov 提交于
      Turn r_ops into a flexible array member to enable large, consisting of
      up to 16 ops, OSD requests.  The use case is scattered writeback in
      cephfs and, as far as the kernel client is concerned, 16 is just a made
      up number.
      
      r_ops had size 3 for copyup+hint+write, but copyup is really a special
      case - it can only happen once.  ceph_osd_request_cache is therefore
      stuffed with num_ops=2 requests, anything bigger than that is allocated
      with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
      means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
      users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
      that.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3f1af42a
    • Y
      libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op · 7665d85b
      Yan, Zheng 提交于
      This avoids defining large array of r_reply_op_{len,result} in
      in struct ceph_osd_request.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7665d85b
    • I
      libceph: rename ceph_osd_req_op::payload_len to indata_len · de2aa102
      Ilya Dryomov 提交于
      Follow userspace nomenclature on this - the next commit adds
      outdata_len.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      de2aa102
    • I
      libceph: monc hunt rate is 3s with backoff up to 30s · 168b9090
      Ilya Dryomov 提交于
      Unless we are in the process of setting up a client (i.e. connecting to
      the monitor cluster for the first time), apply a backoff: every time we
      want to reopen a session, increase our timeout by a multiple (currently
      2); when we complete the connection, reduce that multipler by 50%.
      
      Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      168b9090
    • I
      libceph: monc ping rate is 10s · 58d81b12
      Ilya Dryomov 提交于
      Split ping interval and ping timeout: ping interval is 10s; keepalive
      timeout is 30s.
      
      Make monc_ping_timeout a constant while at it - it's not actually
      exported as a mount option (and the rest of tick-related settings won't
      be either), so it's got no place in ceph_options.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      58d81b12
    • I
      libceph: revamp subs code, switch to SUBSCRIBE2 protocol · 82dcabad
      Ilya Dryomov 提交于
      It is currently hard-coded in the mon_client that mdsmap and monmap
      subs are continuous, while osdmap sub is always "onetime".  To better
      handle full clusters/pools in the osd_client, we need to be able to
      issue continuous osdmap subs.  Revamp subs code to allow us to specify
      for each sub whether it should be continuous or not.
      
      Although not strictly required for the above, switch to SUBSCRIBE2
      protocol while at it, eliminating the ambiguity between a request for
      "every map since X" and a request for "just the latest" when we don't
      have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
      required - it's been supported since pre-argonaut (2010).
      
      Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
      in before we validate the epoch and successfully install the new map
      can mess up mon_client sub state.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      82dcabad