1. 31 3月, 2020 3 次提交
  2. 30 3月, 2020 5 次提交
  3. 29 3月, 2020 1 次提交
  4. 28 3月, 2020 3 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Allow to retrieve cgroup v1 classid from v2 hooks · 5a52ae4e
      Daniel Borkmann 提交于
      Today, Kubernetes is still operating on cgroups v1, however, it is
      possible to retrieve the task's classid based on 'current' out of
      connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
      which attach to the root cgroup v2 hook in a mixed env like in case
      of Cilium, for example, in order to then correlate certain pod traffic
      and use it as part of the key for BPF map lookups.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
      5a52ae4e
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  5. 20 3月, 2020 1 次提交
  6. 19 3月, 2020 1 次提交
  7. 16 3月, 2020 4 次提交
  8. 15 3月, 2020 3 次提交
    • P
      net: sched: RED: Introduce an ECN nodrop mode · 0a7fad23
      Petr Machata 提交于
      When the RED Qdisc is currently configured to enable ECN, the RED algorithm
      is used to decide whether a certain SKB should be marked. If that SKB is
      not ECN-capable, it is early-dropped.
      
      It is also possible to keep all traffic in the queue, and just mark the
      ECN-capable subset of it, as appropriate under the RED algorithm. Some
      switches support this mode, and some installations make use of it.
      
      To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
      configured with this flag, non-ECT traffic is enqueued instead of being
      early-dropped.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a7fad23
    • P
      net: sched: Allow extending set of supported RED flags · 14bc175d
      Petr Machata 提交于
      The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
      of global RED flags. These are passed in tc_red_qopt.flags. However none of
      these qdiscs validate the flag field, and just copy it over wholesale to
      internal structures, and later dump it back. (An exception is GRED, which
      does validate for VQs -- however not for the main setup.)
      
      A broken userspace can therefore configure a qdisc with arbitrary
      unsupported flags, and later expect to see the flags on qdisc dump. The
      current ABI therefore allows storage of several bits of custom data to
      qdisc instances of the types mentioned above. How many bits, depends on
      which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
      flags ECN and HARDDROP, and the rest is not interpreted.
      
      If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
      and at the same time it needs to retain the possibility to store 6 bits of
      uninterpreted data. Likewise RED, which adds a new flag later in this
      patchset.
      
      To that end, this patch adds a new function, red_get_flags(), to split the
      passed flags of RED-like qdiscs to flags and user bits, and
      red_validate_flags() to validate the resulting configuration. It further
      adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14bc175d
    • J
      net: phy: Add XLGMII interface define · 58b05e58
      Jose Abreu 提交于
      Add a define for XLGMII interface.
      Signed-off-by: NJose Abreu <Jose.Abreu@synopsys.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58b05e58
  9. 14 3月, 2020 11 次提交
  10. 13 3月, 2020 8 次提交