1. 26 5月, 2016 33 次提交
    • I
      libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61
      Ilya Dryomov 提交于
      This adds support and switches rbd to a new, more reliable version of
      watch/notify protocol.  As with the OSD client update, this is mostly
      about getting the right structures linked into the right places so that
      reconnects are properly sent when needed.  watch/notify v2 also
      requires sending regular pings to the OSDs - send_linger_ping().
      
      A major change from the old watch/notify implementation is the
      introduction of ceph_osd_linger_request - linger requests no longer
      piggy back on ceph_osd_request.  ceph_osd_event has been merged into
      ceph_osd_linger_request.
      
      All the details are now hidden within libceph, the interface consists
      of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
      ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
      the lifetime management simple.
      
      ceph_osdc_notify_ack() accepts an optional data payload, which is
      relayed back to the notifier.
      
      Portions of this patch are loosely based on work by Douglas Fuller
      <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      922dab61
    • I
      rbd: rbd_dev_header_unwatch_sync() variant · c525f036
      Ilya Dryomov 提交于
      Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
      callbacks.  This is for the new rados_watcherrcb_t, which would be
      called from a notify callback.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c525f036
    • I
      libceph: wait_request_timeout() · 42b06965
      Ilya Dryomov 提交于
      The unwatch timeout is currently implemented in rbd.  With
      watch/unwatch code moving into libceph, we are going to need
      a ceph_osdc_wait_request() variant with a timeout.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42b06965
    • I
      libceph: request_init() and request_release_checks() · 3540bfdb
      Ilya Dryomov 提交于
      These are going to be used by request_reinit() code.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3540bfdb
    • I
      libceph: a major OSD client update · 5aea3dcd
      Ilya Dryomov 提交于
      This is a major sync up, up to ~Jewel.  The highlights are:
      
      - per-session request trees (vs a global per-client tree)
      - per-session locking (vs a global per-client rwlock)
      - homeless OSD session
      - no ad-hoc global per-client lists
      - support for pool quotas
      - foundation for watch/notify v2 support
      - foundation for map check (pool deletion detection) support
      
      The switchover is incomplete: lingering requests can be setup and
      teared down but aren't ever reestablished.  This functionality is
      restored with the introduction of the new lingering infrastructure
      (ceph_osd_linger_request, linger_work, etc) in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      5aea3dcd
    • I
      libceph: protect osdc->osd_lru list with a spinlock · 9dd2845c
      Ilya Dryomov 提交于
      OSD client is getting moved from the big per-client lock to a set of
      per-session locks.  The big rwlock would only be held for read most of
      the time, so a global osdc->osd_lru needs additional protection.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      9dd2845c
    • I
      libceph: allocate ceph_osd with GFP_NOFAIL · 7a28f59b
      Ilya Dryomov 提交于
      create_osd() is called way too deep in the stack to be able to error
      out in a sane way; a failing create_osd() just messes everything up.
      The current req_notarget list solution is broken - the list is never
      traversed as it's not entirely clear when to do it, I guess.
      
      If we were to start traversing it at regular intervals and retrying
      each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
      so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
      with a better fix.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7a28f59b
    • I
      libceph: osd_init() and osd_cleanup() · 0247a0cf
      Ilya Dryomov 提交于
      These are going to be used by homeless OSD sessions code.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0247a0cf
    • I
      libceph: handle_one_map() · 42c1b124
      Ilya Dryomov 提交于
      Separate osdmap handling from decoding and iterating over a bag of maps
      in a fresh MOSDMap message.  This sets up the scene for the updated OSD
      client.
      
      Of particular importance here is the addition of pi->was_full, which
      can be used to answer "did this pool go full -> not-full in this map?".
      This is the key bit for supporting pool quotas.
      
      We won't be able to downgrade map_sem for much longer, so drop
      downgrade_write().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42c1b124
    • I
      libceph: allocate dummy osdmap in ceph_osdc_init() · e5253a7b
      Ilya Dryomov 提交于
      This leads to a simpler osdmap handling code, particularly when dealing
      with pi->was_full, which is introduced in a later commit.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      e5253a7b
    • I
      libceph: schedule tick from ceph_osdc_init() · fbca9635
      Ilya Dryomov 提交于
      Both homeless OSD sessions and watch/notify v2, introduced in later
      commits, require periodic ticks which don't depend on ->num_requests.
      Schedule the initial tick from ceph_osdc_init() and reschedule from
      handle_timeout() unconditionally.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fbca9635
    • I
      libceph: move schedule_delayed_work() in ceph_osdc_init() · b37ee1b9
      Ilya Dryomov 提交于
      ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
      with handle_osds_timeout() running on invalid memory if any one of the
      allocations fails.  Call schedule_delayed_work() after everything is
      setup, just before returning.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      b37ee1b9
    • I
      libceph: redo callbacks and factor out MOSDOpReply decoding · fe5da05e
      Ilya Dryomov 提交于
      If you specify ACK | ONDISK and set ->r_unsafe_callback, both
      ->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
      very confusing.  Redo this so that only one of them is called:
      
          ->r_unsafe_callback(true), on ack
          ->r_unsafe_callback(false), on commit
      
      or
      
          ->r_callback, on ack|commit
      
      Decode everything in decode_MOSDOpReply() to reduce clutter.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fe5da05e
    • I
      libceph: drop msg argument from ceph_osdc_callback_t · 85e084fe
      Ilya Dryomov 提交于
      finish_read(), its only user, uses it to get to hdr.data_len, which is
      what ->r_result is set to on success.  This gains us the ability to
      safely call callbacks from contexts other than reply, e.g. map check.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      85e084fe
    • I
      libceph: switch to calc_target(), part 2 · bb873b53
      Ilya Dryomov 提交于
      The crux of this is getting rid of ceph_osdc_build_request(), so that
      MOSDOp can be encoded not before but after calc_target() calculates the
      actual target.  Encoding now happens within ceph_osdc_start_request().
      
      Also nuked is the accompanying bunch of pointers into the encoded
      buffer that was used to update fields on each send - instead, the
      entire front is re-encoded.  If we want to support target->name_len !=
      base->name_len in the future, there is no other way, because oid is
      surrounded by other fields in the encoded buffer.
      
      Encoding OSD ops and adding data items to the request message were
      mixed together in osd_req_encode_op().  While we want to re-encode OSD
      ops, we don't want to add duplicate data items to the message when
      resending, so all call to ceph_osdc_msg_data_add() are factored out
      into a new setup_request_data().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bb873b53
    • I
      libceph: switch to calc_target(), part 1 · a66dd383
      Ilya Dryomov 提交于
      Replace __calc_request_pg() and most of __map_request() with
      calc_target() and start using req->r_t.
      
      ceph_osdc_build_request() however still encodes base_oid, because it's
      called before calc_target() is and target_oid is empty at that point in
      time; a printf in osdc_show() also shows base_oid.  This is fixed in
      "libceph: switch to calc_target(), part 2".
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a66dd383
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: pi->min_size, pi->last_force_request_resend · 04812acf
      Ilya Dryomov 提交于
      Add and decode pi->min_size and pi->last_force_request_resend.  These
      are going to be used by calc_target().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      04812acf
    • I
      libceph: make pgid_cmp() global · f984cb76
      Ilya Dryomov 提交于
      calc_target() code is going to need to know how to compare PGs.  Take
      lhs and rhs pgid by const * while at it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f984cb76
    • I
      libceph: rename ceph_calc_pg_primary() · f81f1633
      Ilya Dryomov 提交于
      Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
      emphasise that it returns acting primary.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f81f1633
    • I
      libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45
      Ilya Dryomov 提交于
      Knowning just acting set isn't enough, we need to be able to record up
      set as well to detect interval changes.  This means returning (up[],
      up_len, up_primary, acting[], acting_len, acting_primary) and passing
      it around.  Introduce and switch to ceph_osds to help with that.
      
      Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
      both up and acting sets from it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6f3bfd45
    • I
      libceph: rename ceph_oloc_oid_to_pg() · d9591f5e
      Ilya Dryomov 提交于
      Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
      that returned is raw PG and return -ENOENT instead of -EIO if the pool
      doesn't exist.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d9591f5e
    • I
      libceph: fix ceph_eversion encoding · 985c1673
      Ilya Dryomov 提交于
      eversion_t is version+epoch in userspace and is encoded in that order.
      ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
      in __send_request().  Reoder ceph_eversion fields.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      985c1673
    • I
      libceph: DEFINE_RB_FUNCS macro · fcd00b68
      Ilya Dryomov 提交于
      Given
      
          struct foo {
              u64 id;
              struct rb_node bar_node;
          };
      
      generate insert_bar(), erase_bar() and lookup_bar() functions with
      
          DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
      
      The key is assumed to be an integer (u64, int, etc), compared with
      < and >.  nodefld has to be initialized with RB_CLEAR_NODE().
      
      Start using it for MDS, MON and OSD requests and OSD sessions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fcd00b68
    • I
      libceph: open-code remove_{all,old}_osds() · 42a2c09f
      Ilya Dryomov 提交于
      They are called only once, from ceph_osdc_stop() and
      handle_osds_timeout() respectively.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      42a2c09f
    • I
      libceph: nuke unused fields and functions · 0c0a8de1
      Ilya Dryomov 提交于
      Either unused or useless:
      
          osdmap->mkfs_epoch
          osd->o_marked_for_keepalive
          monc->num_generic_requests
          osdc->map_waiters
          osdc->last_requested_map
          osdc->timeout_tid
      
          osd_req_op_cls_response_data()
      
          osdmap_apply_incremental() @msgr arg
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0c0a8de1
    • I
      rbd: use header_oid instead of header_name · c41d13a3
      Ilya Dryomov 提交于
      Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
      const char *.  This reduces noise in rbd_dev_header_name().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c41d13a3
    • I
      libceph: variable-sized ceph_object_id · d30291b9
      Ilya Dryomov 提交于
      Currently ceph_object_id can hold object names of up to 100
      (CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
      expect one - long rbd image names:
      
      - a format 1 header is named "<imgname>.rbd"
      - an object that points to a format 2 header is named "rbd_id.<imgname>"
      
      We operate on these potentially long-named objects during rbd map, and,
      for format 1 images, during header refresh.  (A format 2 header name is
      a small system-generated string.)
      
      Lift this 100 character limit by making ceph_object_id be able to point
      to an externally-allocated string.  Apart from being able to work with
      almost arbitrarily-long named objects, this allows us to reduce the
      size of ceph_object_id from >100 bytes to 64 bytes.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d30291b9
    • I
      libceph: change how osd_op_reply message size is calculated · 711da55d
      Ilya Dryomov 提交于
      For a message pool message, preallocate a page, just like we do for
      osd_op.  For a normal message, take ceph_object_id into account and
      don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      711da55d
    • I
      libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16
      Ilya Dryomov 提交于
      The size of ->r_request and ->r_reply messages depends on the size of
      the object name (ceph_object_id), while the size of ceph_osd_request is
      fixed.  Move message allocation into a separate function that would
      have to be called after ceph_object_id and ceph_object_locator (which
      is also going to become variable in size with RADOS namespaces) have
      been filled in:
      
          req = ceph_osdc_alloc_request(...);
          <fill in req->r_base_oid>
          <fill in req->r_base_oloc>
          ceph_osdc_alloc_messages(req);
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      13d1ad16
    • I
      libceph: grab snapc in ceph_osdc_alloc_request() · 84127282
      Ilya Dryomov 提交于
      ceph_osdc_build_request() is going away.  Grab snapc and initialize
      ->r_snapid in ceph_osdc_alloc_request().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      84127282
    • I
      libceph: make ceph_osdc_put_request() accept NULL · 3ed97d63
      Ilya Dryomov 提交于
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      3ed97d63
    • I
      rbd: get/put img_request in rbd_img_request_submit() · 663ae2cc
      Ilya Dryomov 提交于
      By the time we get to checking for_each_obj_request_safe(img_request)
      terminating condition, all obj_requests may be complete and img_request
      ref, that rbd_img_request_submit() takes away from its caller, may be
      put.  Moving the next_obj_request cursor is then a use-after-free on
      img_request.
      
      It's totally benign, as the value that's read is never used, but
      I think it's still worth fixing.
      
      Cc: Alex Elder <elder@linaro.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      663ae2cc
  2. 16 5月, 2016 1 次提交
  3. 15 5月, 2016 6 次提交
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5f95063c
      Linus Torvalds 提交于
      Pull x86 fix from Thomas Gleixner:
       "Just the missing compat entry for the new pread/writev2"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Use compat version for preadv2 and pwritev2
      5f95063c
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 272911b8
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Fix mvneta/bm dependencies, from Arnd Bergmann.
      
       2) RX completion hw bug workaround in bnxt_en, from Michael Chan.
      
       3) Kernel pointer leak in nf_conntrack, from Linus.
      
       4) Hoplimit route attribute limits not enforced properly, from Paolo
          Abeni.
      
       5) qlcnic driver NULL deref fix from Dan Carpenter.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        arm64: bpf: jit JMP_JSET_{X,K}
        net/route: enforce hoplimit max value
        nf_conntrack: avoid kernel pointer value leak in slab name
        drivers: net: xgene: fix register offset
        drivers: net: xgene: fix statistics counters race condition
        drivers: net: xgene: fix ununiform latency across queues
        drivers: net: xgene: fix sharing of irqs
        drivers: net: xgene: fix IPv4 forward crash
        xen-netback: fix extra_info handling in xenvif_tx_err()
        net: mvneta: bm: fix dependencies again
        bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)
        bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
        qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()
      272911b8
    • Z
      arm64: bpf: jit JMP_JSET_{X,K} · 98397fc5
      Zi Shen Lim 提交于
      Original implementation commit e54bcde3 ("arm64: eBPF JIT compiler")
      had the relevant code paths, but due to an oversight always fail jiting.
      
      As a result, we had been falling back to BPF interpreter whenever a BPF
      program has JMP_JSET_{X,K} instructions.
      
      With this fix, we confirm that the corresponding tests in lib/test_bpf
      continue to pass, and also jited.
      
      ...
      [    2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
      [    2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
      [    2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
      ...
      [    3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 PASS
      [    3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0xffffffff) return 1 jited:1 98 PASS
      [    3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 PASS
      [    3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0xffffffff) return 1 jited:1 89 PASS
      ...
      
      Fixes: e54bcde3 ("arm64: eBPF JIT compiler")
      Signed-off-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98397fc5
    • P
      net/route: enforce hoplimit max value · 626abd59
      Paolo Abeni 提交于
      Currently, when creating or updating a route, no check is performed
      in both ipv4 and ipv6 code to the hoplimit value.
      
      The caller can i.e. set hoplimit to 256, and when such route will
       be used, packets will be sent with hoplimit/ttl equal to 0.
      
      This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
      ipv6 route code, substituting any value greater than 255 with 255.
      
      This is consistent with what is currently done for ADVMSS and MTU
      in the ipv4 code.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      626abd59
    • L
      nf_conntrack: avoid kernel pointer value leak in slab name · 31b0b385
      Linus Torvalds 提交于
      The slab name ends up being visible in the directory structure under
      /sys, and even if you don't have access rights to the file you can see
      the filenames.
      
      Just use a 64-bit counter instead of the pointer to the 'net' structure
      to generate a unique name.
      
      This code will go away in 4.7 when the conntrack code moves to a single
      kmemcache, but this is the backportable simple solution to avoiding
      leaking kernel pointers to user space.
      
      Fixes: 5b3501fa ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31b0b385
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 6ba5b85f
      Linus Torvalds 提交于
      Pull vfs fixes from Al Viro:
       "Overlayfs fixes from Miklos, assorted fixes from me.
      
        Stable fodder of varying severity, all sat in -next for a while"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        ovl: ignore permissions on underlying lookup
        vfs: add lookup_hash() helper
        vfs: rename: check backing inode being equal
        vfs: add vfs_select_inode() helper
        get_rock_ridge_filename(): handle malformed NM entries
        ecryptfs: fix handling of directory opening
        atomic_open(): fix the handling of create_error
        fix the copy vs. map logics in blk_rq_map_user_iov()
        do_splice_to(): cap the size before passing to ->splice_read()
      6ba5b85f