1. 21 7月, 2016 1 次提交
  2. 20 7月, 2016 3 次提交
  3. 16 7月, 2016 1 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86
  4. 13 7月, 2016 1 次提交
  5. 10 7月, 2016 1 次提交
  6. 06 7月, 2016 3 次提交
  7. 05 7月, 2016 3 次提交
  8. 03 7月, 2016 1 次提交
  9. 02 7月, 2016 4 次提交
  10. 01 7月, 2016 1 次提交
  11. 30 6月, 2016 5 次提交
    • M
      fib_rules: Added NLM_F_EXCL support to fib_nl_newrule · 153380ec
      Mateusz Bajorski 提交于
      When adding rule with NLM_F_EXCL flag then check if the same rule exist.
      If yes then exit with -EEXIST.
      
      This is already implemented in iproute2:
              if (cmd == RTM_NEWRULE) {
                      req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
                      req.r.rtm_type = RTN_UNICAST;
              }
      
      Tested ipv4 and ipv6 with net-next linux on qemu x86
      
      expected behavior after patch:
      localhost ~ # ip rule
      0:    from all lookup local
      32766:    from all lookup main
      32767:    from all lookup default
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
      RTNETLINK answers: File exists
      localhost ~ # ip rule
      0:    from all lookup local
      1005:    from 10.46.177.97 lookup 104
      32766:    from all lookup main
      32767:    from all lookup default
      
      There was already topic regarding this but I don't see any changes
      merged and problem still occurs.
      https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomainSigned-off-by: NMateusz Bajorski <mateusz.bajorski@nokia.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      153380ec
    • N
      net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute · 80e73cc5
      Nikolay Aleksandrov 提交于
      This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
      which allows to export per-slave statistics if the master device supports
      the linkxstats callback. The attribute is passed down to the linkxstats
      callback and it is up to the callback user to use it (an example has been
      added to the only current user - the bridge). This allows us to query only
      specific slaves of master devices like bridge ports and export only what
      we're interested in instead of having to dump all ports and searching only
      for a single one. This will be used to export per-port IGMP/MLD stats and
      also per-port vlan stats in the future, possibly other statistics as well.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80e73cc5
    • D
      bpf: add bpf_skb_change_type helper · d2485c42
      Daniel Borkmann 提交于
      This work adds a helper for changing skb->pkt_type in a controlled way.
      We only allow a subset of possible values and can extend that in future
      should other use cases come up. Doing this as a helper has the advantage
      that errors can be handeled gracefully and thus helper kept extensible.
      
      It's a write counterpart to pkt_type member we can already read from
      struct __sk_buff context. Major use case is to change incoming skbs to
      PACKET_HOST in a programmatic way instead of having to recirculate via
      redirect(..., BPF_F_INGRESS), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2485c42
    • D
      bpf: add bpf_skb_change_proto helper · 6578171a
      Daniel Borkmann 提交于
      This patch adds a minimal helper for doing the groundwork of changing
      the skb->protocol in a controlled way. Currently supported is v4 to
      v6 and vice versa transitions, which allows f.e. for a minimal, static
      nat64 implementation where applications in containers that still
      require IPv4 can be transparently operated in an IPv6-only environment.
      For example, host facing veth of the container can transparently do
      the transitions in a programmatic way with the help of clsact qdisc
      and cls_bpf.
      
      Idea is to separate concerns for keeping complexity of the helper
      lower, which means that the programs utilize bpf_skb_change_proto(),
      bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done,
      instead of doing everything in a single helper (and thus partially
      duplicating helper functionality). Also, bpf_skb_change_proto()
      shouldn't need to deal with raw packet data as this is done by other
      helpers.
      
      bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to
      operate on a private one, push or pop additionally required header
      space and migrate the gso/gro meta data from the shared info. We do
      mark the gso type as dodgy so that headers are checked and segs
      recalculated by the gso/gro engine. The gso_size target is adapted
      as well. The flags argument added is currently reserved and can be
      used for future extensions.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6578171a
    • D
      bpf: don't use raw processor id in generic helper · 80b48c44
      Daniel Borkmann 提交于
      Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
      instead of the raw variant. This allows for preemption checks when we
      have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
      need to keep the raw variant for socket filters, but we can reuse the
      helper that is already there from cBPF side.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80b48c44
  12. 29 6月, 2016 2 次提交
    • D
      neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit() · b560f03d
      David Barroso 提交于
      neigh_xmit() expects to be called inside an RCU-bh read side critical
      section, and while one of its two current callers gets this right, the
      other one doesn't.
      
      More specifically, neigh_xmit() has two callers, mpls_forward() and
      mpls_output(), and while both callers call neigh_xmit() under
      rcu_read_lock(), this provides sufficient protection for neigh_xmit()
      only in the case of mpls_forward(), as that is always called from
      softirq context and therefore doesn't need explicit BH protection,
      while mpls_output() can be called from process context with softirqs
      enabled.
      
      When mpls_output() is called from process context, with softirqs
      enabled, we can be preempted by a softirq at any time, and RCU-bh
      considers the completion of a softirq as signaling the end of any
      pending read-side critical sections, so if we do get a softirq
      while we are in the part of neigh_xmit() that expects to be run inside
      an RCU-bh read side critical section, we can end up with an unexpected
      RCU grace period running right in the middle of that critical section,
      making things go boom.
      
      This patch fixes this impedance mismatch in the callee, by making
      neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
      expects to be treated as an RCU-bh read side critical section, as this
      seems a safer option than fixing it in the callers.
      
      Fixes: 4fd3d7d9 ("neigh: Add helper function neigh_xmit")
      Signed-off-by: NDavid Barroso <dbarroso@fastly.com>
      Signed-off-by: NLennert Buytenhek <lbuytenhek@fastly.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b560f03d
    • W
      net: the space is required before the open parenthesis '(' · 8a01ed70
      Wei Tang 提交于
      The space is missing before the open parenthesis '(', and this
      will introduce much more noise when checking patch around.
      Signed-off-by: NWei Tang <tangwei@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a01ed70
  13. 26 6月, 2016 1 次提交
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  14. 17 6月, 2016 2 次提交
  15. 16 6月, 2016 2 次提交
  16. 13 6月, 2016 1 次提交
  17. 11 6月, 2016 2 次提交
  18. 09 6月, 2016 4 次提交
    • D
      net: Add l3mdev rule · 96c63fa7
      David Ahern 提交于
      Currently, VRFs require 1 oif and 1 iif rule per address family per
      VRF. As the number of VRF devices increases it brings scalability
      issues with the increasing rule list. All of the VRF rules have the
      same format with the exception of the specific table id to direct the
      lookup. Since the table id is available from the oif or iif in the
      loopup, the VRF rules can be consolidated to a single rule that pulls
      the table from the VRF device.
      
      This patch introduces a new rule attribute l3mdev. The l3mdev rule
      means the table id used for the lookup is pulled from the L3 master
      device (e.g., VRF) rather than being statically defined. With the
      l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
      rule per address family (IPv4 and IPv6).
      
      If an admin wishes to insert higher priority rules for specific VRFs
      those rules will co-exist with the l3mdev rule. This capability means
      current VRF scripts will co-exist with this new simpler implementation.
      
      Currently, the rules list for both ipv4 and ipv6 look like this:
          $ ip  ru ls
          1000:       from all oif vrf1 lookup 1001
          1000:       from all iif vrf1 lookup 1001
          1000:       from all oif vrf2 lookup 1002
          1000:       from all iif vrf2 lookup 1002
          1000:       from all oif vrf3 lookup 1003
          1000:       from all iif vrf3 lookup 1003
          1000:       from all oif vrf4 lookup 1004
          1000:       from all iif vrf4 lookup 1004
          1000:       from all oif vrf5 lookup 1005
          1000:       from all iif vrf5 lookup 1005
          1000:       from all oif vrf6 lookup 1006
          1000:       from all iif vrf6 lookup 1006
          1000:       from all oif vrf7 lookup 1007
          1000:       from all iif vrf7 lookup 1007
          1000:       from all oif vrf8 lookup 1008
          1000:       from all iif vrf8 lookup 1008
          ...
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      With the l3mdev rule the list is just the following regardless of the
      number of VRFs:
          $ ip ru ls
          1000:       from all lookup [l3mdev table]
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      (Note: the above pretty print of the rule is based on an iproute2
             prototype. Actual verbage may change)
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96c63fa7
    • E
      net: sched: fix missing doc annotations · 123b3652
      Eric Dumazet 提交于
      "make htmldocs" complains otherwise:
      
      .//net/core/gen_stats.c:168: warning: No description found for parameter 'running'
      .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key'
      
      Fixes: f9eb8aea ("net_sched: transform qdisc running bit into a seqcount")
      Fixes: edb09eb1 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      123b3652
    • E
      net_sched: add missing paddattr description · e0d194ad
      Eric Dumazet 提交于
      "make htmldocs" complains otherwise:
      
      .//net/core/gen_stats.c:65: warning: No description found for parameter 'padattr'
      .//net/core/gen_stats.c:101: warning: No description found for parameter 'padattr'
      
      Fixes: 9854518e ("sched: align nlattr properly when needed")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0d194ad
    • H
      net: Reduce queue allocation to one in kdump kernel · 40e4e713
      Hariprasad Shenai 提交于
      When in kdump kernel, reduce memory usage by only using a single Queue
      Set for multiqueue devices. So make netif_get_num_default_rss_queues()
      return one, when in kdump kernel.
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40e4e713
  19. 08 6月, 2016 2 次提交
    • B
      net-sysfs: fix missing <linux/of_net.h> · 88832a22
      Ben Dooks 提交于
      The of_find_net_device_by_node() function is defined in
      <linux/of_net.h> but not included in the .c file that
      implements it. Fix the following warning by including the
      header:
      
      net/core/net-sysfs.c:1494:19: warning: symbol 'of_find_net_device_by_node' was not declared. Should it be static?
      Signed-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88832a22
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1