1. 02 7月, 2016 2 次提交
    • M
      cgroup: bpf: Add bpf_skb_in_cgroup_proto · 4a482f34
      Martin KaFai Lau 提交于
      Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
      belongs to a descendant of a cgroup2.  It is similar to the
      feature added in netfilter:
      commit c38c4597 ("netfilter: implement xt_cgroup cgroup2 path match")
      
      The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
      which will be used by the bpf_skb_in_cgroup.
      
      Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
      and bpf_skb_in_cgroup() are always used together.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a482f34
    • D
      bpf: refactor bpf_prog_get and type check into helper · 113214be
      Daniel Borkmann 提交于
      Since bpf_prog_get() and program type check is used in a couple of places,
      refactor this into a small helper function that we can make use of. Since
      the non RO prog->aux part is not used in performance critical paths and a
      program destruction via RCU is rather very unlikley when doing the put, we
      shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
      check, but actually not taking the ref at all (due to being in fdget() /
      fdput() section of the bpf fd) is even cleaner and makes the diff smaller
      as well, so just go for that. Callsites are changed to make use of the new
      helper where possible.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      113214be
  2. 01 7月, 2016 7 次提交
  3. 30 6月, 2016 7 次提交
    • M
      fib_rules: Added NLM_F_EXCL support to fib_nl_newrule · 153380ec
      Mateusz Bajorski 提交于
      When adding rule with NLM_F_EXCL flag then check if the same rule exist.
      If yes then exit with -EEXIST.
      
      This is already implemented in iproute2:
              if (cmd == RTM_NEWRULE) {
                      req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
                      req.r.rtm_type = RTN_UNICAST;
              }
      
      Tested ipv4 and ipv6 with net-next linux on qemu x86
      
      expected behavior after patch:
      localhost ~ # ip rule
      0:    from all lookup local
      32766:    from all lookup main
      32767:    from all lookup default
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      RTNETLINK answers: File exists
      localhost ~ # ip rule
      0:    from all lookup local
      1005:    from 10.46.177.97 lookup 104
      32766:    from all lookup main
      32767:    from all lookup default
      
      There was already topic regarding this but I don't see any changes
      merged and problem still occurs.
      https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomainSigned-off-by: NMateusz Bajorski <mateusz.bajorski@nokia.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      153380ec
    • A
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin 提交于
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
    • N
      net: bridge: add support for IGMP/MLD stats and export them via netlink · 1080ab95
      Nikolay Aleksandrov 提交于
      This patch adds stats support for the currently used IGMP/MLD types by the
      bridge. The stats are per-port (plus one stat per-bridge) and per-direction
      (RX/TX). The stats are exported via netlink via the new linkxstats API
      (RTM_GETSTATS). In order to minimize the performance impact, a new option
      is used to enable/disable the stats - multicast_stats_enabled, similar to
      the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
      lookups and checks, we make use of the current "igmp" member of the bridge
      private skb->cb region to record the type on Rx (both host-generated and
      external packets pass by multicast_rcv()). We can do that since the igmp
      member was used as a boolean and all the valid IGMP/MLD types are positive
      values. The normal bridge fast-path is not affected at all, the only
      affected paths are the flooding ones and since we make use of the IGMP/MLD
      type, we can quickly determine if the packet should be counted using
      cache-hot data (cb's igmp member). We add counters for:
      * IGMP Queries
      * IGMP Leaves
      * IGMP v1/v2/v3 reports
      
      * MLD Queries
      * MLD Leaves
      * MLD v1/v2 reports
      
      These are invaluable when monitoring or debugging complex multicast setups
      with bridges.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1080ab95
    • N
      net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute · 80e73cc5
      Nikolay Aleksandrov 提交于
      This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
      which allows to export per-slave statistics if the master device supports
      the linkxstats callback. The attribute is passed down to the linkxstats
      callback and it is up to the callback user to use it (an example has been
      added to the only current user - the bridge). This allows us to query only
      specific slaves of master devices like bridge ports and export only what
      we're interested in instead of having to dump all ports and searching only
      for a single one. This will be used to export per-port IGMP/MLD stats and
      also per-port vlan stats in the future, possibly other statistics as well.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80e73cc5
    • D
      bpf: add bpf_skb_change_type helper · d2485c42
      Daniel Borkmann 提交于
      This work adds a helper for changing skb->pkt_type in a controlled way.
      We only allow a subset of possible values and can extend that in future
      should other use cases come up. Doing this as a helper has the advantage
      that errors can be handeled gracefully and thus helper kept extensible.
      
      It's a write counterpart to pkt_type member we can already read from
      struct __sk_buff context. Major use case is to change incoming skbs to
      PACKET_HOST in a programmatic way instead of having to recirculate via
      redirect(..., BPF_F_INGRESS), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2485c42
    • D
      bpf: add bpf_skb_change_proto helper · 6578171a
      Daniel Borkmann 提交于
      This patch adds a minimal helper for doing the groundwork of changing
      the skb->protocol in a controlled way. Currently supported is v4 to
      v6 and vice versa transitions, which allows f.e. for a minimal, static
      nat64 implementation where applications in containers that still
      require IPv4 can be transparently operated in an IPv6-only environment.
      For example, host facing veth of the container can transparently do
      the transitions in a programmatic way with the help of clsact qdisc
      and cls_bpf.
      
      Idea is to separate concerns for keeping complexity of the helper
      lower, which means that the programs utilize bpf_skb_change_proto(),
      bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done,
      instead of doing everything in a single helper (and thus partially
      duplicating helper functionality). Also, bpf_skb_change_proto()
      shouldn't need to deal with raw packet data as this is done by other
      helpers.
      
      bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to
      operate on a private one, push or pop additionally required header
      space and migrate the gso/gro meta data from the shared info. We do
      mark the gso type as dodgy so that headers are checked and segs
      recalculated by the gso/gro engine. The gso_size target is adapted
      as well. The flags argument added is currently reserved and can be
      used for future extensions.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6578171a
    • D
      bpf: don't use raw processor id in generic helper · 80b48c44
      Daniel Borkmann 提交于
      Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
      instead of the raw variant. This allows for preemption checks when we
      have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
      need to keep the raw variant for socket filters, but we can reuse the
      helper that is already there from cBPF side.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80b48c44
  4. 29 6月, 2016 14 次提交
  5. 28 6月, 2016 5 次提交
  6. 27 6月, 2016 3 次提交
  7. 26 6月, 2016 2 次提交
    • E
      net_sched: generalize bulk dequeue · 4d202a0d
      Eric Dumazet 提交于
      When qdisc bulk dequeue was added in linux-3.18 (commit
      5772e9a3 "qdisc: bulk dequeue support for qdiscs
      with TCQ_F_ONETXQUEUE"), it was constrained to some
      specific qdiscs.
      
      With some extra care, we can extend this to all qdiscs,
      so that typical traffic shaping solutions can benefit from
      small batches (8 packets in this patch).
      
      For example, HTB is often used on some multi queue device.
      And bonding/team are multi queue devices...
      
      Idea is to bulk-dequeue packets mapping to the same transmit queue.
      
      This brings between 35 and 80 % performance increase in HTB setup
      under pressure on a bonding setup :
      
      1) NUMA node contention :   610,000 pps -> 1,110,000 pps
      2) No node contention   : 1,380,000 pps -> 1,930,000 pps
      
      Now we should work to add batches on the enqueue() side ;)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d202a0d
    • E
      net_sched: sch_htb: export class backlog in dumps · 338ed9b4
      Eric Dumazet 提交于
      We already get child qdisc qlen, we also can get its backlog
      so that class dumps can report it.
      
      Also replace qstats by a single drop counter, but move it in
      a separate cache line so that drops do not dirty useful cache lines.
      
      Tested:
      
      $ tc -s cl sh dev eth0
      class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
       Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
       rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
       lended: 9021770 borrowed: 0 giants: 0
       tokens: 9 ctokens: 9
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338ed9b4