1. 12 2月, 2019 3 次提交
  2. 11 2月, 2019 9 次提交
    • A
      Merge branch 'skb_sk-sk_fullsock-tcp_sock' · d105fa98
      Alexei Starovoitov 提交于
      Martin KaFai Lau says:
      
      ====================
      This series adds __sk_buff->sk, "struct bpf_tcp_sock",
      BPF_FUNC_sk_fullsock and BPF_FUNC_tcp_sock.  Together, they provide
      a common way to expose the members of "struct tcp_sock" and
      "struct bpf_sock" for the bpf_prog to access.
      
      The patch series first adds a bpf_sock pointer to __sk_buff
      and a new helper BPF_FUNC_sk_fullsock.
      
      It then adds BPF_FUNC_tcp_sock to get a bpf_tcp_sock
      pointer from a bpf_sock pointer.
      
      The current use case is to allow a cg_skb_bpf_prog to provide
      per cgroup traffic policing/shaping.
      
      Please see individual patch for details.
      
      v2:
      - Patch 1 depends on
        commit d6238766 ("bpf: Fix narrow load on a bpf_sock returned from sk_lookup()")
        in the bpf branch.
      - Add sk_to_full_sk() to bpf_sk_fullsock() and bpf_tcp_sock()
        such that there is a way to access the listener's sk and tcp_sk
        when __sk_buff->sk is a request_sock.
        The comments in the uapi bpf.h is updated accordingly.
      - bpf_ctx_range_till() is used in bpf_sock_common_is_valid_access()
        in patch 1.  Saved a few lines.
      - Patch 2 is new in v2 and it adds "state", "dst_ip4", "dst_ip6" and
        "dst_port" to the bpf_sock.  Narrow load is allowed on them.
        The "state" (i.e. sk_state) has already been used in
        INET_DIAG (e.g. ss -t) and getsockopt(TCP_INFO).
      - While at it in the new patch 2, also allow narrow load on some
        existing fields of the bpf_sock, which are "family", "type", "protocol"
        and "src_port".  Only allow loading from first byte for now.
        i.e. does not allow narrow load starting from the 2nd byte.
      - Add some narrow load tests to the test_verifier's sock.c
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d105fa98
    • M
      bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock · e0b27b3f
      Martin KaFai Lau 提交于
      This patch adds a C program to show the usage on
      skb->sk and bpf_tcp_sock.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0b27b3f
    • M
      bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer · fb47d1d9
      Martin KaFai Lau 提交于
      This patch tests accessing the skb->sk and the new helpers,
      bpf_sk_fullsock and bpf_tcp_sock.
      
      The errstr of some existing "reference tracking" tests is changed
      with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
      verifier's reg_type_str[].
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fb47d1d9
    • M
      bpf: Sync bpf.h to tools/ · 281f9e75
      Martin KaFai Lau 提交于
      This patch sync the uapi bpf.h to tools/.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      281f9e75
    • M
      bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock · 655a51e5
      Martin KaFai Lau 提交于
      This patch adds a helper function BPF_FUNC_tcp_sock and it
      is currently available for cg_skb and sched_(cls|act):
      
      struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_tcp_sock *tp;
      	struct bpf_sock *sk;
      	__u32 snd_cwnd;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	tp = bpf_tcp_sock(sk);
      	if (!tp)
      		return 1;
      
      	snd_cwnd = tp->snd_cwnd;
      	/* ... */
      
      	return 1;
      }
      
      A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
      read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
      that has already been exposed by the bpf_sock_ops.
      i.e. no new tcp_sock's fields are exposed in bpf.h.
      
      This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
      or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
      returns NULL.  Hence, the caller needs to check for NULL before
      accessing it.
      
      The current use case is to expose members from tcp_sock
      to allow a cg_skb_bpf_prog to provide per cgroup traffic
      policing/shaping.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      655a51e5
    • M
      bpf: Refactor sock_ops_convert_ctx_access · 9b1f3d6e
      Martin KaFai Lau 提交于
      The next patch will introduce a new "struct bpf_tcp_sock" which
      exposes the same tcp_sock's fields already exposed in
      "struct bpf_sock_ops".
      
      This patch refactor the existing convert_ctx_access() codes for
      "struct bpf_sock_ops" to get them ready to be reused for
      "struct bpf_tcp_sock".  The "rtt_min" is not refactored
      in this patch because its handling is different from other
      fields.
      
      The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
      changes are code move only.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9b1f3d6e
    • M
      bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock · aa65d696
      Martin KaFai Lau 提交于
      This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
      bpf_sock.  The userspace has already been using "state",
      e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).
      
      This patch also allows narrow load on the following existing fields:
      "family", "type", "protocol" and "src_port".  Unlike IP address,
      the load offset is resticted to the first byte for them but it
      can be relaxed later if there is a use case.
      
      This patch also folds __sock_filter_check_size() into
      bpf_sock_is_valid_access() since it is not called
      by any where else.  All bpf_sock checking is in
      one place.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aa65d696
    • M
      bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper · 46f8bc92
      Martin KaFai Lau 提交于
      In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
      before accessing the fields in sock.  For example, in __netdev_pick_tx:
      
      static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
      			    struct net_device *sb_dev)
      {
      	/* ... */
      
      	struct sock *sk = skb->sk;
      
      		if (queue_index != new_index && sk &&
      		    sk_fullsock(sk) &&
      		    rcu_access_pointer(sk->sk_dst_cache))
      			sk_tx_queue_set(sk, new_index);
      
      	/* ... */
      
      	return queue_index;
      }
      
      This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
      where a few of the convert_ctx_access() in filter.c has already been
      accessing the skb->sk sock_common's fields,
      e.g. sock_ops_convert_ctx_access().
      
      "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
      Some of the fileds in "bpf_sock" will not be directly
      accessible through the "__sk_buff->sk" pointer.  It is limited
      by the new "bpf_sock_common_is_valid_access()".
      e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
           are not allowed.
      
      The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
      can be used to get a sk with all accessible fields in "bpf_sock".
      This helper is added to both cg_skb and sched_(cls|act).
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_sock *sk;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	sk = bpf_sk_fullsock(sk);
      	if (!sk)
      		return 1;
      
      	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
      		return 1;
      
      	/* some_traffic_shaping(); */
      
      	return 1;
      }
      
      (1) The sk is read only
      
      (2) There is no new "struct bpf_sock_common" introduced.
      
      (3) Future kernel sock's members could be added to bpf_sock only
          instead of repeatedly adding at multiple places like currently
          in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
      
      (4) After "sk = skb->sk", the reg holding sk is in type
          PTR_TO_SOCK_COMMON_OR_NULL.
      
      (5) After bpf_sk_fullsock(), the return type will be in type
          PTR_TO_SOCKET_OR_NULL which is the same as the return type of
          bpf_sk_lookup_xxx().
      
          However, bpf_sk_fullsock() does not take refcnt.  The
          acquire_reference_state() is only depending on the return type now.
          To avoid it, a new is_acquire_function() is checked before calling
          acquire_reference_state().
      
      (6) The WARN_ON in "release_reference_state()" is no longer an
          internal verifier bug.
      
          When reg->id is not found in state->refs[], it means the
          bpf_prog does something wrong like
          "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
          never been acquired by calling "bpf_sk_fullsock(skb->sk)".
      
          A -EINVAL and a verbose are done instead of WARN_ON.  A test is
          added to the test_verifier in a later patch.
      
          Since the WARN_ON in "release_reference_state()" is no longer
          needed, "__release_reference_state()" is folded into
          "release_reference_state()" also.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      46f8bc92
    • M
      bpf: Fix narrow load on a bpf_sock returned from sk_lookup() · 5f456649
      Martin KaFai Lau 提交于
      By adding this test to test_verifier:
      {
      	"reference tracking: access sk->src_ip4 (narrow load)",
      	.insns = {
      	BPF_SK_LOOKUP,
      	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
      	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
      	BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2),
      	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
      	BPF_EMIT_CALL(BPF_FUNC_sk_release),
      	BPF_EXIT_INSN(),
      	},
      	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
      	.result = ACCEPT,
      },
      
      The above test loads 2 bytes from sk->src_ip4 where
      sk is obtained by bpf_sk_lookup_tcp().
      
      It hits an internal verifier error from convert_ctx_accesses():
      [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
      Failed to load prog 'Invalid argument'!
      0: (b7) r2 = 0
      1: (63) *(u32 *)(r10 -8) = r2
      2: (7b) *(u64 *)(r10 -16) = r2
      3: (7b) *(u64 *)(r10 -24) = r2
      4: (7b) *(u64 *)(r10 -32) = r2
      5: (7b) *(u64 *)(r10 -40) = r2
      6: (7b) *(u64 *)(r10 -48) = r2
      7: (bf) r2 = r10
      8: (07) r2 += -48
      9: (b7) r3 = 36
      10: (b7) r4 = 0
      11: (b7) r5 = 0
      12: (85) call bpf_sk_lookup_tcp#84
      13: (bf) r6 = r0
      14: (15) if r0 == 0x0 goto pc+3
       R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1
      15: (69) r2 = *(u16 *)(r0 +26)
      16: (bf) r1 = r6
      17: (85) call bpf_sk_release#86
      18: (95) exit
      
      from 14 to 18: safe
      processed 20 insns (limit 131072), stack depth 48
      bpf verifier is misconfigured
      Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
      
      The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
      loaded (meaning load any 1 or 2 bytes of the src_ip4) by
      marking info->ctx_field_size.  However, this marked
      ctx_field_size is not used.  This patch fixes it.
      
      Due to the recent refactoring in test_verifier,
      this new test will be added to the bpf-next branch
      (together with the bpf_tcp_sock patchset)
      to avoid merge conflict.
      
      Fixes: c64b7983 ("bpf: Add PTR_TO_SOCKET verifier type")
      Cc: Joe Stringer <joe@wand.net.nz>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5f456649
  3. 09 2月, 2019 5 次提交
    • A
      Merge branch 'btf-api-extensions' · 28bbfc3a
      Alexei Starovoitov 提交于
      Andrii Nakryiko says:
      
      ====================
      This patchset introduces a set of new APIs that make it possible to work with BTF
      more effectively (and without involving kernel) for applications like pahole that
      need to manipulate .BTF and .BTF.ext data.
      
      Patch #1 changes existing btf__new() API call to only load and initialize
      struct btf, while exposing new btf__load() API to attempt to load and validate
      BTF in kernel.
      
      Patch #2 adds btf__get_raw_data() API allowing to get access to raw BTF data from
      struct btf.
      
      Patch #3 adds similar btf_ext__get_raw_data() API for working with struct btf_ext.
      
      Patch #4 removes not-yet-stable btf__get_strings() API which was added to be able
      to test contents of struct btf for btf__dedup(). It's now superseded by raw APIs.
      
      v3->v4:
      - formatting fixes
      - renamed btf_ext functions/structs to use "setup" language instead of "copy"
      - removed btf__get_strings from libbpf.map
      
      v2->v3:
      - const void* variants of btf__get_raw_data()
      - added btf_ext__get_raw_data()
      - removed btf__get_strings() and adapted test_btf.c to use btf__get_raw_data()
      
      v1->v2:
      - btf_load() returns just error, not fd
      - fix ordering in libbpf.map
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      28bbfc3a
    • A
      tools/bpf: remove btf__get_strings() superseded by raw data API · 49b57e0d
      Andrii Nakryiko 提交于
      Now that we have btf__get_raw_data() it's trivial for tests to iterate
      over all strings for testing purposes, which eliminates the need for
      btf__get_strings() API.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      49b57e0d
    • A
      btf: expose API to work with raw btf_ext data · ae4ab4b4
      Andrii Nakryiko 提交于
      This patch changes struct btf_ext to retain original data in sequential
      block of memory, which makes it possible to expose
      btf_ext__get_raw_data() interface similar to btf__get_raw_data(), allowing
      users of libbpf to get access to raw representation of .BTF.ext section.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ae4ab4b4
    • A
      btf: expose API to work with raw btf data · 02c87446
      Andrii Nakryiko 提交于
      This patch exposes new API btf__get_raw_data() that allows to get a copy
      of raw BTF data out of struct btf. This is useful for external programs
      that need to manipulate raw data, e.g., pahole using btf__dedup() to
      deduplicate BTF type info and then writing it back to file.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      02c87446
    • A
      btf: separate btf creation and loading · d29d87f7
      Andrii Nakryiko 提交于
      This change splits out previous btf__new functionality of constructing
      struct btf and loading it into kernel into two:
      - btf__new() just creates and initializes struct btf
      - btf__load() attempts to load existing struct btf into kernel
      
      btf__free will still close BTF fd, if it was ever loaded successfully
      into kernel.
      
      This change allows users of libbpf to manipulate BTF using its API,
      without the need to unnecessarily load it into kernel.
      
      One of the intended use cases is pahole, which will do DWARF to BTF
      conversion and then use libbpf to do type deduplication, while then
      handling ELF sections overwriting and other concerns on its own.
      
      Fixes: 2d3feca8 ("bpf: btf: print map dump and lookup with btf info")
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d29d87f7
  4. 08 2月, 2019 23 次提交
    • Y
      tools/bpf: add log_level to bpf_load_program_attr · a4021a35
      Yonghong Song 提交于
      The kernel verifier has three levels of logs:
          0: no logs
          1: logs mostly useful
        > 1: verbose
      
      Current libbpf API functions bpf_load_program_xattr() and
      bpf_load_program() cannot specify log_level.
      The bcc, however, provides an interface for user to
      specify log_level 2 for verbose output.
      
      This patch added log_level into structure
      bpf_load_program_attr, so users, including bcc, can use
      bpf_load_program_xattr() to change log_level. The
      supported log_level is 0, 1, and 2.
      
      The bpf selftest test_sock.c is modified to enable log_level = 2.
      If the "verbose" in test_sock.c is changed to true,
      the test will output logs like below:
        $ ./test_sock
        func#0 @0
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        0: (bf) r6 = r1
        1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        1: (61) r7 = *(u32 *)(r6 +28)
        invalid bpf_context access off=28 size=4
      
        Test case: bind4 load with invalid access: src_ip6 .. [PASS]
        ...
        Test case: bind6 allow all .. [PASS]
        Summary: 16 PASSED, 0 FAILED
      
      Some test_sock tests are negative tests and verbose verifier
      log will be printed out as shown in the above.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a4021a35
    • A
      tools/bpf: add missing strings.h include · 62b8cea6
      Andrii Nakryiko 提交于
      Few files in libbpf are using bzero() function (defined in strings.h header), but
      don't include corresponding header. When libbpf is added as a dependency to pahole,
      this undeterministically causes warnings on some machines:
      
      bpf.c:225:2: warning: implicit declaration of function 'bzero' [-Wimplicit-function-declaration]
        bzero(&attr, sizeof(attr));
          ^~~~~
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Reported-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      62b8cea6
    • M
      net: fixed-phy: Add fixed_phy_register_with_gpiod() API · 71bd106d
      Moritz Fischer 提交于
      Add fixed_phy_register_with_gpiod() API. It lets users create a
      fixed_phy instance that uses a GPIO descriptor which was obtained
      externally e.g. through platform data.
      This enables platform devices (non-DT based) to use GPIOs for link
      status.
      Signed-off-by: NMoritz Fischer <mdf@kernel.org>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71bd106d
    • D
      Merge branch 'Add-comphy-support-for-Armada-38x' · a4751093
      David S. Miller 提交于
      Russell King says:
      
      ====================
      Add comphy support for Armada 38x
      
      This series adds support for the comphy for Armada 38x, which allows
      these SoCs to use 2500BASE-X mode with appropriate SFP modules.
      
      Tested on SolidRun Clearfog after updating for the 5.0 merge window
      changes.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4751093
    • R
      ARM: dts: clearfog: add comphy settings for Ethernet interfaces · f548ced1
      Russell King 提交于
      Add the comphy settings for the Ethernet interfaces.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f548ced1
    • R
      net: marvell: neta: add comphy support · a10c1c81
      Russell King 提交于
      Add support for the common phy binding, so that we can reconfigure the
      comphy according to the desired ethernet speed.  This will allow us to
      support 1000base-X and 2500base-X SFPs dynamically on SolidRun Clearfog.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10c1c81
    • R
      dt-bindings: net: mvneta: add phys property · 4ca124f4
      Russell King 提交于
      Add an optional phys property to the mvneta binding documentation for
      the common phy.
      Reviewed-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ca124f4
    • R
      ARM: dts: add description for Armada 38x common phy · f3a6a9f3
      Russell King 提交于
      Add the DT description for the Armada 38x common phy.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3a6a9f3
    • R
      phy: armada38x: add common phy support · 14dc100b
      Russell King 提交于
      Add support for the Armada 38x common phy to allow us to change the
      speed of the Ethernet serdes lane.  This driver only supports
      manipulation of the speed, it does not support configuration of the
      common phy.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14dc100b
    • R
      dt-bindings: phy: Armada 38x common phy bindings · 12038271
      Russell King 提交于
      Add the Marvell Armada 38x common phy bindings.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12038271
    • D
      Merge branch 'smc-next' · f06f095f
      David S. Miller 提交于
      Ursula Braun says:
      
      ====================
      net/smc: patches 2019-02-07
      
      here are patches for SMC:
      * patches 1, 3, and 6 are cleanups without functional change
      * patch 2 postpones closing of internal clcsock
      * patches 4 and 5 improve link group creation locking
      * patch 7 restores AF_SMC as diag_family field
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f06f095f
    • K
      net/smc: original socket family in inet_sock_diag · 232dc8ef
      Karsten Graul 提交于
      Commit ed75986f ("net/smc: ipv6 support for smc_diag.c") changed the
      value of the diag_family field. The idea was to indicate the family of
      the IP address in the inet_diag_sockid field. But the change makes it
      impossible to distinguish an inet_sock_diag response message from SMC
      sock_diag response. This patch restores the original behaviour and sends
      AF_SMC as value of the diag_family field.
      
      Fixes: ed75986f ("net/smc: ipv6 support for smc_diag.c")
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      232dc8ef
    • K
      net/smc: move code to clear the conn->lgr field · 8fc002b0
      Karsten Graul 提交于
      The lgr field of an smc_connection is set in smc_conn_create() and
      should be cleared in smc_conn_free() for consistency reasons, so move
      the responsible code.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fc002b0
    • H
      net/smc: use client and server LGR pending locks for SMC-R · 72a36a8a
      Hans Wippel 提交于
      If SMC client and server connections are both established at the same
      time, smc_connect_rdma() cannot send a CLC confirm message while
      smc_listen_work() is waiting for one due to lock contention. This can
      result in timeouts in smc_clc_wait_msg() and failed SMC connections.
      
      In case of SMC-R, there are two types of LGRs (client and server LGRs)
      which can be protected by separate locks. So, this patch splits the LGR
      pending lock into two separate locks for client and server to avoid the
      locking issue for SMC-R.
      Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72a36a8a
    • H
      net/smc: unlock LGR pending lock earlier for SMC-D · 62c7139f
      Hans Wippel 提交于
      If SMC client and server connections are both established at the same
      time, smc_connect_ism() cannot send a CLC confirm message while
      smc_listen_work() is waiting for one due to lock contention. This can
      result in timeouts in smc_clc_wait_msg() and failed SMC connections.
      
      In case of SMC-D, the LGR pending lock is not needed while
      smc_listen_work() is waiting for the CLC confirm message. So, this patch
      releases the lock earlier for SMC-D to avoid the locking issue.
      Signed-off-by: NHans Wippel <hwippel@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62c7139f
    • U
      net/smc: use smc_curs_copy() for SMC-D · a225d2cd
      Ursula Braun 提交于
      SMC already provides a wrapper for atomic64 calls to be
      architecture independent. Use this wrapper for SMC-D as well.
      Reported-by: NJens Remus <jremus@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a225d2cd
    • U
      net/smc: postpone release of clcsock · b03faa1f
      Ursula Braun 提交于
      According to RFC7609 (http://www.rfc-editor.org/info/rfc7609)
      first the SMC-R connection is shut down and then the normal TCP
      connection FIN processing drives cleanup of the internal TCP connection.
      The unconditional release of the clcsock during active socket closing
      has to be postponed if the peer has not yet signalled socket closing.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b03faa1f
    • U
      s390/net: move pnet constants · 41c80be2
      Ursula Braun 提交于
      There is no need to define these PNETID related constants in
      the pnet.h file, since they are just used locally within pnet.c.
      Just code cleanup, no functional change.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c80be2
    • P
      net: vxlan: Free a leaked vetoed multicast rdst · fc4aa1ca
      Petr Machata 提交于
      When an rdst is rejected by a driver, the current code removes it from
      the remote list, but neglects to free it. This is triggered by
      tools/testing/selftests/drivers/net/mlxsw/vxlan_fdb_veto.sh and shows as
      the following kmemleak trace:
      
      unreferenced object 0xffff88817fa3d888 (size 96):
        comm "softirq", pid 0, jiffies 4372702718 (age 165.252s)
        hex dump (first 32 bytes):
          02 00 00 00 c6 33 64 03 80 f5 a2 61 81 88 ff ff  .....3d....a....
          06 df 71 ae ff ff ff ff 0c 00 00 00 04 d2 6a 6b  ..q...........jk
        backtrace:
          [<00000000296b27ac>] kmem_cache_alloc_trace+0x1ae/0x370
          [<0000000075c86dc6>] vxlan_fdb_append.part.12+0x62/0x3b0 [vxlan]
          [<00000000e0414b63>] vxlan_fdb_update+0xc61/0x1020 [vxlan]
          [<00000000f330c4bd>] vxlan_fdb_add+0x2e8/0x3d0 [vxlan]
          [<0000000008f81c2c>] rtnl_fdb_add+0x4c2/0xa10
          [<00000000bdc4b270>] rtnetlink_rcv_msg+0x6dd/0x970
          [<000000006701f2ce>] netlink_rcv_skb+0x290/0x410
          [<00000000c08a5487>] rtnetlink_rcv+0x15/0x20
          [<00000000d5f54b1e>] netlink_unicast+0x43f/0x5e0
          [<00000000db4336bb>] netlink_sendmsg+0x789/0xcd0
          [<00000000e1ee26b6>] sock_sendmsg+0xba/0x100
          [<00000000ba409802>] ___sys_sendmsg+0x631/0x960
          [<000000003c332113>] __sys_sendmsg+0xea/0x180
          [<00000000f4139144>] __x64_sys_sendmsg+0x78/0xb0
          [<000000006d1ddc59>] do_syscall_64+0x94/0x410
          [<00000000c8defa9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Move vxlan_dst_free() up and schedule a call thereof to plug this leak.
      
      Fixes: 61f46fe8 ("vxlan: Allow vetoing of FDB notifications")
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4aa1ca
    • D
      Merge branch 'devlink-health' · 0739d24d
      David S. Miller 提交于
      Eran Ben Elisha says:
      
      ====================
      Devlink health reporting and recovery system
      
      The health mechanism is targeted for Real Time Alerting, in order to know when
      something bad had happened to a PCI device
      - Provide alert debug information
      - Self healing
      - If problem needs vendor support, provide a way to gather all needed debugging
        information.
      
      The main idea is to unify and centralize driver health reports in the
      generic devlink instance and allow the user to set different
      attributes of the health reporting and recovery procedures.
      
      The devlink health reporter:
      Device driver creates a "health reporter" per each error/health type.
      Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
      or unknown (driver specific).
      For each registered health reporter a driver can issue error/health reports
      asynchronously. All health reports handling is done by devlink.
      Device driver can provide specific callbacks for each "health reporter", e.g.
       - Recovery procedures
       - Diagnostics and object dump procedures
       - OOB initial attributes
      Different parts of the driver can register different types of health reporters
      with different handlers.
      
      Once an error is reported, devlink health will do the following actions:
        * A log is being send to the kernel trace events buffer
        * Health status and statistics are being updated for the reporter instance
        * Object dump is being taken and saved at the reporter instance (as long as
          there is no other dump which is already stored)
        * Auto recovery attempt is being done. Depends on:
          - Auto-recovery configuration
          - Grace period vs. time passed since last recover
      
      The user interface:
      User can access/change each reporter attributes and driver specific callbacks
      via devlink, e.g per error type (per health reporter)
       - Configure reporter's generic attributes (like: Disable/enable auto recovery)
       - Invoke recovery procedure
       - Run diagnostics
       - Object dump
      
      The devlink health interface (via netlink):
      DEVLINK_CMD_HEALTH_REPORTER_GET
        Retrieves status and configuration info per DEV and reporter.
      DEVLINK_CMD_HEALTH_REPORTER_SET
        Allows reporter-related configuration setting.
      DEVLINK_CMD_HEALTH_REPORTER_RECOVER
        Triggers a reporter's recovery procedure.
      DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
        Retrieves diagnostics data from a reporter on a device.
      DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
        Retrieves the last stored dump. Devlink health
        saves a single dump. If an dump is not already stored by the devlink
        for this reporter, devlink generates a new dump.
        dump output is defined by the reporter.
      DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
        Clears the last saved dump file for the specified reporter.
      
                                                     netlink
                                            +--------------------------+
                                            |                          |
                                            |            +             |
                                            |            |             |
                                            +--------------------------+
                                                         |request for ops
                                                         |(diagnose,
       mlx5_core                             devlink     |recover,
                                                         |dump)
      +--------+                            +--------------------------+
      |        |                            |    reporter|             |
      |        |                            |  +---------v----------+  |
      |        |   ops execution            |  |                    |  |
      |     <----------------------------------+                    |  |
      |        |                            |  |                    |  |
      |        |                            |  + ^------------------+  |
      |        |                            |    | request for ops     |
      |        |                            |    | (recover, dump)     |
      |        |                            |    |                     |
      |        |                            |  +-+------------------+  |
      |        |     health report          |  | health handler     |  |
      |        +------------------------------->                    |  |
      |        |                            |  +--------------------+  |
      |        |     health reporter create |                          |
      |        +---------------------------->                          |
      +--------+                            +--------------------------+
      
      In this patchset, mlx5e TX reporter is implemented.
      
      Cmdline format:
          devlink health show [DEV reporter REPORTE_NAME]
          devlink health recover DEV reporter REPORTER_NAME
          devlink health diagnose DEV reporter REPORTER_NAME
          devlink health dump show DEV reporter REPORTER_NAME
          devlink health dump clear DEV reporter REPORTER_NAME
          devlink health set DEV reporter REPORTER_NAME NAME VALUE
      
      Cmdline examples:
      $devlink health show
      pci/0000:00:09.0:
        name tx
          state healthy #err 1 #recover 0 last_dump_ts N/A
          parameters:
            grace_period 500 auto_recover false
      
      $devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
      {
          "SQs": [ {
                  "sqn": 138,
                  "HW state": 1,
                  "stopped": false
              },{
                  "sqn": 142,
                  "HW state": 1,
                  "stopped": false
              } ]
      }
      
      $devlink health diagnose pci/0000:00:09.0 reporter tx
      SQs:
        sqn: 138 HW state: 1 stopped: false
        sqn: 142 HW state: 1 stopped: false
      
      $devlink health recover pci/0000:00:09 reporter tx
      
      $devlink health set pci/0000:00:09.0 reporter tx grace_period 3500
      
      $devlink health set pci/0000:00:09.0 reporter tx auto_recover false
      
      Changelog:
      v4:
      - Rebase on latest net-next
      - Remove trace_devlink_health signature exposure in case CONFIG_NET_DEVLINK is
        not defined as it shall only be used from devlink.
      
      v3:
      - Redesign of devlink <-> driver fmsg API
      - Various bug fixes
      
      v2:
      - Remove FW* reporters to decrease the amount of patches in the patchset
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0739d24d
    • A
      devlink: Add Documentation/networking/devlink-health.txt · db2ab7a0
      Aya Levin 提交于
      This patch adds a new file to add information about devlink health
      mechanism.
      Signed-off-by: NAya Levin <ayal@mellanox.com>
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db2ab7a0
    • E
      net/mlx5e: Add tx timeout support for mlx5e tx reporter · 7d91126b
      Eran Ben Elisha 提交于
      With this patch, ndo_tx_timeout callback will be redirected to the tx
      reporter in order to detect a tx timeout error and report it to the
      devlink health. (The watchdog detects tx timeouts, but the driver verify
      the issue still exists before launching any recover method).
      
      In addition, recover from tx timeout in case of lost interrupt was added
      to the tx reporter recover method. The tx timeout recover from lost
      interrupt is not a new feature in the driver, this patch re-organize the
      functionality and move it to the tx reporter recovery flow.
      
      tx timeout example:
      (with auto_recover set to false, if set to true, the manual recover and
      diagnose sections are irrelevant)
      
      $cat /sys/kernel/debug/tracing/trace
      ...
      devlink_health_report: bus_name=pci dev_name=0000:00:09.0
      driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
      CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
      
      $devlink health show
      pci/0000:00:09.0:
        name tx
          state healthy #err 1 #recover 0 last_dump_ts N/A
          parameters:
            grace_period 500 auto_recover false
      
      $devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
      {
          "SQs": [ {
                  "sqn": 138,
                  "HW state": 1,
                  "stopped": true
              },{
                  "sqn": 142,
                  "HW state": 1,
                  "stopped": false
              } ]
      }
      
      $devlink health diagnose pci/0000:00:09.0 reporter tx
      SQs:
        sqn: 138 HW state: 1 stopped: true
        sqn: 142 HW state: 1 stopped: false
      
      $devlink health recover pci/0000:00:09 reporter tx
      $devlink health show
      pci/0000:00:09.0:
        name tx
          state healthy #err 1 #recover 1 last_dump_ts N/A
          parameters:
            grace_period 500 auto_recover false
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d91126b
    • E
      net/mlx5e: Add tx reporter support · de8650a8
      Eran Ben Elisha 提交于
      Add mlx5e tx reporter to devlink health reporters. This reporter will be
      responsible for diagnosing, reporting and recovering of tx errors.
      This patch declares the TX reporter operations and creates it using the
      devlink health API. Currently, this reporter supports reporting and
      recovering from send error CQE only. In addition, it adds diagnose
      information for the open SQs.
      
      For a local SQ recover (due to driver error report), in case of SQ recover
      failure, the recover operation will be considered as a failure.
      For a full tx recover, an attempt to close and open the channels will be
      done. If this one passed successfully, it will be considered as a
      successful recover.
      
      The SQ recover from error CQE flow is not a new feature in the driver,
      this patch re-organize the functions and adapt them for the devlink
      health API. For this purpose, move code from en_main.c to a new file
      named reporter_tx.c.
      
      Diagnose output:
      $devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
      {
          "SQs": [ {
                  "sqn": 138,
                  "HW state": 1,
                  "stopped": false
              },{
                  "sqn": 142,
                  "HW state": 1,
                  "stopped": false
              } ]
      }
      
      $devlink health diagnose pci/0000:00:09.0 reporter tx
      SQs:
        sqn: 138 HW state: 1 stopped: false
        sqn: 142 HW state: 1 stopped: false
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8650a8